Data science is an exciting field with vast areas of applications and skills that you need to acquire. If you are reading this I am sure you have been reading long lists of resources, courses and books. I am trying to do something different here which is minimum viable knowledge to get into data science. Let's call it MVK for data science. To be fair and realistic I have to tell you the not very good news, data science means different things to different people. At some companies, it's extensive excel work whereas at some others it's spark jobs and parallel processing. In between the two extremes lies most of the companies with requirements that vary in technicality.
This MVK guide is meant for mainstream data science practice. It's hard to say what's mainstream today, however, I would like to say that skills and knowledge that are core and without any advanced or special requirements. This curated list of skills and resources are of my own so they represent my experience and personal opinion. Almost every course here is something I have taken personally or recommended to more than one of my mentees.
Minimum Viable Knowledge (MVK) for data science to help you start solid
These are core data science skills and a person must be exposed to at least the main concepts. I do think that having an overview is good enough, however, some depth of knowledge accumulated will take you further in your journey. I don't claim that you need to know everything in this guide to land an internship or an entry-level position. Nonetheless, the more you know the higher your chances are because you set yourself on top of the list in terms of potential.
There is no training data set in real life, that pristine CSV doesn't exist. There could be a data warehouse or multiple data storage systems with hundreds of tables where the organization stored their data. It's your responsibility to dive in and build a dataset for the use case at hand. Since almost every data storage has SQL or SQL-like medium to interact with data it's primitive to know how to SQL your way into these systems.
I would go further to say that you will spend 3 quarters of your time interacting with data. Pre and post-modelling work require you to perform a variety of tasks from explorations to analysing data. Being fluent in SQL is essential and therefore you should get yourself into good SQL practices as you grow.
- Introduction to SQL from Kaggle is highly recommended
- Advanced SQL is building on top of the intro course which goes to build your strong SQL foundations
There are more courses and books for learning python than I could count. So, here is what I think is the best book/course for you, it's what you find good for you!! Myself, I loved the book "Python for informatics" by Charles Severance and was delighted when I found the course on Coursera "Python for Everybody". The book/course was made for beginners and curated examples and problems that were interesting enough to keep me engaged. Though it's a general python course hence it builds your skills in python as a programming language. I mean this is not the course that teaches pandas and NumPy.
The book is now mostly known as python for everybody and is available as a course on most platforms. It's also available for free online along with its exercise, code and pdf version.
- Book PDF:
- Book Online:
- Youtube playlist:
- Github source code:
Now that python as a programming language course is out of the way, you may ask, what about a data science-oriented course. This one you can learn from different resources as well, however, I would recommend the first two courses from the Coursera specialization "Applied Data Science with Python" By the University of Michigan. There is a steep learning curve to some extent as the learner move from the first course to the second.
- Applied Data Science with Python:
The best course in machine learning is the one on Coursera by Andrew NG. Since its launch in 2012, millions have taken that course including me. The second best course is the new version of the same course by Andrew NG launched this year 2022. It's a remake of the first course with switching to teaching python. I cannot give this course justice no matter what I say. There is more to the course than just the topics, the impeccable delivery of concepts is just unmatched.
- Machine learning specialization by Andrew Ng:
Machine learning is the new statistics
Regardless of how true the statement is, what I want you to recognise here is the correlation. This correlation is an anecdote of connection between the two fields. Therefore, having a solid foundation in statistics will strengthen your ability to understand a lot of the inner workings of machine learning algorithms. Here are a few courses that I found to be interesting:
Introduction to Statistics by Stanford (Free course):
Statistics concepts explained:
I believe that having a grasp of the concepts is essential to make sense of various topics and new concepts in data science. It also helps build a good intuition. Though knowing deeper details and how to calculate is important, you can understand more as you gradually get exposed to more cases and problems.
Some skills make you a better candidate or set you up for success in data science. They are not talked about a lot, but they are expected to work in the data field or even generally in tech. Here are the most important ones you should pay attention to.
Data scientists do various tasks in various organizations, but there is one task that is in common. They are in constant need to present results and insights. This single skill draws the distinction between data scientists. In data science, finesse is not about building the 98% accurate model only, but rallying everyone to ship a data product into production. You need to be a great communicator to be able to drive projects and make people in your organization take your insights further into their implementation.
Communication to me has the following aspects:
- Understanding the matter at hand. You are the subject matter expert and you should understand the matter more than anyone in the room.
- Content structure. Your ideas need to have a flow that connects and makes a cohesive story. The results have to be organized in a way that is smooth and linked.
- Delivery. Here lays the make or break of your quest the ability to deliver results or insights with as little friction as possible. Success here could be in the form of people's positive reactions or questions about how to move to the next step.
There are no courses to recommend, it's more of a process and continuous learning. However, here are a few tips that helped me along the way.
- Learn in public by tweeting and Start a blog in Hashnode and document the topics you learned. This will connect you to more people and build your network.
- Get involved in the tech community in your town or online via discord and others. Step up at times and explain what you have learned over the weekend in an hour virtual call. The more you speak in meetups the better you communicate.
- Try to speak to people from non-tech backgrounds and perhaps try to explain Ai to them in simple terms. This helps you learn to abstract complexities and people will like you for it.
Bonus video from MIT which I found to be interesting and helpful:
Git and version control
Version control systems are nothing new and I thought not to include them here. However, after seeing the number of beginners struggling to make sense of it during internships I thought to bring it up. You must have the basic idea and know your ways around Github. You need to know 3 basic functions, creating a repository, cloning, and pushing changes.
Git and Github crash course:
Working with the command line is a good skill to add. Since most of the production servers are Linux based, your familiarity with the command line will help you do more.
Hands-on Introduction to Linux Commands and Shell Scripting:
Continue learning, and understand the style of learning that works best for you. In my journey I found case-based learning to be the best. For example: when I have a problem be it my model performs 100% accuracy in training but in testing it drops to be 60% accurate. I can go from there and understand all the needed work to solve this problem. There most of the concepts I learned theoretically make more sense and become solid.
If you wondering where is the math, calculus and algebra courses? Well, don't worry about it.
In my personal opinion:
- You need math when you feel your math background doesn't help you understand a concept you are currently learning. At that point go to youtube and understand that concept.
- You don't need to build algorithms from scratch, it's more important to understand the intuition of algorithms. The ability to code from scratch could be a hobby but don't really expect people are building algorithms from scratch at their day-to-day job.
I hope this short list is minimal enough and yet meaningful to help you get started or perhaps brush up on some skills. If you have taken any of the courses listed here feel free to comment and share your thoughts.