Photo by Clark Tibbs on Unsplash
Learning what to learn: a guide for intermediate data scientists
The skills you need to ascend with your data science practice
5 min read
Learning technical skills is a challenging quest. The journey is not always smooth and often learners go through phases. This story is the sequel to my first part Learning How to Learn: A Guide for Aspiring Data Scientists and I want to extend to you here the map of learning. You may ask what an intermediate data scientist is and how can we quantify these levels. Frankly, there is no data-driven definition I have in mind right now, so if you would agree with me let's adopt some heuristics for the sake of simplicity.
At a certain level of your growth, curated courses will fall short of fulfilling your learning objectives. This stage of learning is often chartered by your ability to consume the primary source of knowledge like documentation to build your understanding and intuition.
Hypothetically if the data science learning journey is of some similarity to learning to code, then we can assume that an intermediate level of competence could manifest as the ability to formulate the business problem into a data product requirements/solution, the ability to validate and analyse the model performance and building a successful proof of concept. If you have been tapping into these stages of any data science product then you can safely assume you are at the stage where this story is important for you.
There are several ways in which data scientist can demo their project ideas. In my experience, there were two main frameworks that really helped me do it effectively. Dash and Streamlit were my go-to and I championed their adoption to every team I joined. If you are still in a stage where creating POC with a nice simple UI is not required this story can still hold value in showing you what to learn next.
Fastapi to build APIs
Learning to build an API is a critical skill for a data scientist thinking of getting her hands-on in production. This skill will not only be your selling point when talking to small teams that are hiring but will also be a competitive advantage for you. It allows you to maximize your impact and drive more projects to that production line. There are so many challenges when thinking of putting a model in production, but API development skills should not be one of them.
Although there are several frameworks that can be used but am biased toward a great one FastAPI. I have written a post to help you understand the features and advantages of FastAPI for data scientists. The FastAPI documentation is the best course out there, it's almost the golden standard of good documentation.
Docker and containerised applications
There has been a time when the world of tech was certainly different from what it is today. Virtualisation has simplified many of our needs and boosted how we build systems. The value of learning about containerisation and especially docker is that it brings so much ease and robustness to your products. Deploying your model API via FastAPI is implemented via docker containers and often you would come across tools that are easily deployed as containers. The use cases are a lot and today docker containers are a normal practice where your knowledge of it brings you agility.
There are a few things you need to understand gradually as you learn about docker:
- What is the use-case it solves?
- The difference between virtual machines and containerization.
- Building a Docker image.
- Running a Docker container from an image.
The best course for is from the amazing freecodecamp:
If that two hours course is a lot you may check this article by one of hashnode community members:
The cloud is where all the actions take place. Building a functional knowledge of cloud services and concepts is an operational requirement especially if you aspire to take ownership of the complete lifecycle. This stands to be my personal opinion informed by my working experience in the field of data. Cloud services vendors are not a lot but 3 stand to be the most common. Amazon, Google, and Azure from Microsoft are the top 3 but there are other vendors like Digital Ocean and Linode.
Amazon Web Services (AWS) is the most common and widely adopted. Learning AWS will help you learn any other cloud and understand them in the main time ensuring that you have knowledge of the widely used vendor. There a lot of resources to learn AWS but if you are completely new to the concept of cloud maybe you can start with this AWS-made video:
There are a few main services that you need to understand and have a solid hands-on building on them. These services are:
- Elastic Compute Engine which is a virtual server (EC2): understanding how to create an instance, accessing using SSH, and configure security groups.
- Simple Storage Service (S3): Creating a bucket, pushing a file and reading a file from a bucket using python boto3 library + via the CLI.
- Relational DataBase Service (RDS): Creating an instance, connecting to the database, reading and writing data into a table.
The next services that you can learn about are:
- IAM: identity and access management to understand and learn your way around access management which will improve the security of your practice.
- ECR: container registry is your gateway to deploying containerized applications in the cloud either into an EC2 or ECS cluster.
- DynamoDB: adding this no-SQL storage will give your solutions an added advantage in more ways than one. I start thinking you should learn it first in this list.
As you go on learning more advanced concepts the types of learning resources reduce to be left with the primary sources. The documentation of various projects is your first choice, the second source is your own experimentation and hands-on journey.
The knowledge your gain from learning is compounded by a factor of magnitude when you applied in the next project. You can learn so many concepts and new technologies but that learning remains limited until you start working on problems to solve. In my experience, there are always a set of concepts that can stretch to so many use-cases and every time you apply the same skills but to different problems. The beauty of this is that your skills and experience are enriched each time you work on a new use-case.
I hope this list of recommended learning guides you through. May your accuracy by high and your models don't overfit!