Photo by Hector Bermudez on Unsplash
How to Rock Data Science One Project a Time
The actionable steps to build and communicate data science products
Background
Once upon a time, I opened my Jupyter notebook to show a C-suite manager my work on the project. He had never asked me to show my work ever again after that. If you didn't understand what happened here, imagine the following: a passionate fresh grad was asked by a manager to show his work. The young and broke fresh grade with all excitement presented his python code. Nobody has acknowledged my existence in this big organisation and finally, it's my moment to show my magic; or so I thought. Me presenting my code was a spectacular mistake I have done so far in my career. As time goes by I became so attached to showing my models working via prototypes.
Why was this a mistake? Well, am glad you asked. The manager doesn't understand python and couldn't care less about code in general. He was asking about some reports, insights, or any form that could represent my findings. This is a manager, not your groupmate that you just show him the output of jupyter notebook.
The essential question is how do you show your work and progress as a data scientist? Here I want to share what I have practised over years of working in the field and proven to be remarkable.
Documentation
I hope this does not come as a surprise to you, but they are by far the best ambassadors of your thoughts. I do believe that writing is not just about explaining our ideas to others, but in the first place, writing is thinking clearly. While you write the process that takes place is building connections among your thoughts. The final output is more than a document, it's also clarity.
I do believe that writing is not just about explaining our ideas to others, but in the first place, writing is thinking clearly.
What to write? There are several forms of writing or documentation one can do, but I will limit this to 3 types.
One page
This document is very simple and straightforward. It is one page that describes what you want to do. You should answer three questions:
- What is this project about? You can describe the problem and give context.
- Why are we doing it? Objectives that you want to achieve.
- How are you going to do it? An abstract description with less technical jargon.
This document is your first page and you will write them more often. My rule of thumb is that I jump into creating this every time an idea lights up above my head. Often this is what managers will look at to get the gist of what you are working on.
Where to write this? You can write this literally anywhere that allows you to share documents or collaborate. It can be as simple as Google docs. I have used Dropbox paper and switched to Notion. Your company might have a platform for this or you can simply start one.
Performance analysis
This could be what you should think of as soon as you start progressing on your exploratory data analysis (EDA). More importantly, if you have built some machine learning model this kind of analysis document is how you keep your stakeholders informed. The medium could be a document page or even a slide if you fancy it.
Technical documentation
This document is different in nature. The intent here is to write something that describes the technicalities of the project. In this instance, the targeted audience is fellow data scientists and colleagues. Among the points that you should make clear are:
- Flow of the system or the workflow preferably visually drawn to help communicate through the image.
- The implementation and considerations.
- Links to parts of the code or repository.
Prototyping
Data science projects under the hood could be a lot of things from machine learning models, preprocessing steps, and even data pipelines that are necessary to keep the product running. Nonetheless, your work has to be ultra simple and impactful in the beginning. That means to think of a minimal viable functionality and build it first. Building data products is way easier today than ever. There are several tools out there but I will mention two:
Dash
Dash is a framework to build web-based data applications. Written on top of plotly.js and React.js which makes it ideal for building the web interface of your machine learning models. Dash abstract and full-stack which is empowering for data scientists to build without the need to understand or manage the complexity behind the scenes. We call products built on dash "dash apps or applications".
Streamlit
Streamlit takes a data scientist to another level of building full stack applications with crisp clean web interfaces. The power of streamlit comes from a huge set of pre-built widgets with the ability to customize and be up and running in a few minutes.
How to do it?
Now that you know the most important tools, I think in a simple scenario things will go along the lines of the following flowchart.
Conclusion
The real-life situation might be different but I hope this draws the image for you. One step omitted here (in the above flowchart) is the technical documentation which is something that can be done once the product is deployed or iteratively written while development is still ongoing.
Through this text I hope you have learnt about:
- Documentation: types of documents and to whom we write them.
- Prototyping: Python Frameworks that are used to build an interactive app.
- MVP: Thinking of a minimal viable product for your project.
- Workflow: First steps to take when you have a new idea or concept.
Remember the following quote by Eric Ries:
The only way to win is to learn faster than anyone else.