Farisology

RAG vs Fine-tuning: Which is Better for Your LLM Strategy?

Fares Hasan — Mon, 29 Apr 2024 18:00:21 GMT

I. This is real!!

Imagine this scenario: a lawyer uses ChatGPT to assist with legal research for a high-stakes case. He trusts the AIs capabilities, expecting it to streamline his workflow. However, instead of easing his burden, the AI inadvertently creates chaos. It generates and suggests completely fictitious legal cases and citations, which the lawyer, unaware of the inaccuracies, includes in his official court documents. This leads to a bewildering situation in court, undermining his credibility and affecting the case outcome.

This story isn't just a cautionary tale; it's a reality we face as businesses and professionals increasingly depend on advanced AI language models without fully understanding their limitations. These tools, while sophisticated, can produce erroneous hallucinated information that seems entirely plausible.

In this article, I will help you understand the abilities and challenges of current AI technologies and show how we can move towards more dependable systems. We will explore the details of Retrieval-Augmented Generation (RAG) and fine-tuning, assisting you in determining the most suitable approach for your requirements.

II. How LLMs are Trained

Let's start with a quick primer on how these powerful language models are trained. Think of them as digital sponges, soaking up vast amounts of textual data from the internet and books. Through a process called self-supervised learning, they learn to predict missing words and understand the context of sentences. It's like solving a massive fill-in-the-blanks puzzle, training the models to understand and generate human-like language.

Because large language models are trained on a huge corpus of textual data. This both gave the models revolutionary capabilities compared to their predecessors and also manifested new challenges. Today we know that the latest model from Meta, llama3 was trained on over 15T tokens of data which is 7 times the training set of its previous model llama2.

III. Challenges Faced by LLMs

General models like GPT3.5 and llama3 are useful for various tasks. However, they come with challenges that can vary in severity based on the specific use case. What are these challenges?

Limited access to up-to-date information
Lack of expertise in specific domains
Lack of factualness and accuracy
Hallucinations

You might not notice this clearly if you ask ChatGPT to write you a bio in Star Wars Jedi style. However, if you ask it to help you answer some law-related questions about the state of California, you might encounter laws that do not exist or references to cases that never happened.

IV. Generative AI Approaches

There are at least two core factors that we can think of to illustrate these approaches.

External Data: Dependency on information and external data is widespread. Organizations could have data that is unique or private to them and is not in the public domain. Gauge the dependency on this data to have good GenAI products.
Capability & Domain Understanding: If the model cannot perform the tasks you expect or shows a lack of domain understanding, it can indicate that you have a higher dependency on this metric for your use case.

The matrix above intuitively illustrates a progression from low dependency on both metrics (external data and domain understanding), representing use cases or problems that can be solved with prompt engineering. You can test this approach with more advanced prompts during evaluation. However, you might end up in the RAG or Finetuning approaches, which are the main focus in this post. Regarding the finetune+RAG, that is an area where both factors (external data and domain understanding) have a high dependency in your use case.

Retrieval-Augmented Generation (RAG)

RAG is a technique that combines external information retrieval with text generation. In RAG systems, information is retrieved from external sources such as databases or web content and then incorporated into the text generation process. This approach enhances the generated content by grounding it in real-time or domain-specific data, resulting in more accurate and contextually relevant responses.

RAG combines two components: a retriever and a generator. The retriever acts like a smart librarian, scouring external knowledge sources (think Wikipedia, web pages, or specialized databases) to find relevant information for a given input or query. The generator, our trusty language model, then takes that retrieved knowledge and crafts a final output, weaving the facts seamlessly into its response.

Imagine asking an AI assistant powered by RAG, "What are the key events that led to the American Revolution?" The retriever would scour its knowledge base, fetching relevant passages about the Boston Tea Party, the Stamp Act, and other historical events. The generator would then use this retrieved information to construct a well-researched, factual answer, providing a comprehensive overview of the revolutionary events.

You can see here that we have anchored our AI model answers with facts and information that are highly relevant. This makes RAG one of the desired approaches today.

Fine-tuning for Domain or Task Adaptation and Personalization

But what if you want an AI model tailored to a specific domain or task? That's where fine-tuning comes into play. Just like a talented actor preparing for a new role, fine-tuning involves adapting a pre-trained language model to excel in a particular area. It's like giving the model personalized training sessions using task-specific data or carefully crafted prompts.

For instance, let's say you're a legal firm looking to generate error-free contracts and briefs. You could take a general language model and fine-tune it on a vast corpus of legal documents, teaching it the nuances and terminology of the legal domain. Example of fine-tuning methods:

Task Specific: Fine-tuning often starts with task-specific datasets, where the model is exposed to examples and labelled data relevant to the target task.
Domain Adaptation: Fine-tuning can be domain-specific, where the model is adapted to perform well in a particular industry or field. For instance, fine-tuning an LLM on medical literature to generate medical reports.
Style Transfer: Models can be fine-tuned to mimic a specific writing style or tone. For example, training an LLM to generate content in the style of a famous author.

V. Choosing the Right Technique: RAG or Fine-tuning?

So, which technique should you choose: RAG or fine-tuning? The answer depends on your specific needs and resources. Remember the matrix we have started this article with.

RAG correlates with knowledge and expands it for the model. Whereas Fine-tuning correlates with skills and capabilities that you want the model to acquire or perform better.

If you're tackling a knowledge-intensive task like open-domain question answering or generating content across various topics, RAG might be your best bet. By tapping into vast external knowledge sources, RAG can provide well-researched, factual outputs on a wide range of subjects.

On the other hand, if you're working on a domain-specific task like medical dialogue systems or technical writing, fine-tuning could be the way to go. By training the model on task-specific data, you can create a highly specialized AI assistant tailored to your particular domain's intricacies.

And for those seeking a truly personalized AI experience, you could combine both techniques. Fine-tune a RAG model on your specific domain data and preferences, unlocking an AI assistant that's not only knowledgeable but also perfectly aligned with your unique needs.

VI. Conclusion

As we delve deeper into the immense capabilities of generative AI, methods such as RAG and fine-tuning are paving the way for new horizons. Advanced iterations of RAG-based systems are being envisioned to enhance performance and address challenges. LORA techniques for fine-tuning are instrumental in constructing compact yet potent models. These innovative methods will progress, offering a multitude of opportunities. I am optimistic that humanity can leverage these advancements to enhance livelihoods. Achieving this goal will demand substantial effort, but for now, democratization can aid in demonstrating the worth and feasibility of these emerging technologies.

So, what's your AI vision? Whether you're an entrepreneur seeking to revolutionize customer service, a researcher pushing the boundaries of natural language processing, or simply someone who loves to tinker with emerging technologies, the time is ripe to dive into the world of RAG and fine-tuning. Unleash the full potential of generative AI and let your imagination soar!

Pineconedocumentation could help you a lot take your first baby steps into building a RAG. Building RAG tutorial using docs(Colab notebook).

Rethinking AI Training: Lessons from the 'Textbooks is All You Need' Study

Fares Hasan — Tue, 26 Sep 2023 10:41:19 GMT

Introduction

In an era where magnitude is frequently equated with mastery, the realm of artificial intelligence (AI) has been fervently chasing the creation of ever-larger and more intricate models. The prevailing belief among tech giants seems to be that grandeur in AI is synonymous with greatness. Yet, what if our perspective has been skewed? Drawing inspiration from the insightful findings of Microsoft Research in "Textbooks is All You Need" I invite you on a journey to explore an alternative viewpoint. This is a brief summary of the most important highlights of performance, comparison, and findings of the team.

The Importance of High-Quality Data

In the rapidly evolving landscape of artificial intelligence, data stands as the foundation upon which models are built. The age-old adage "garbage in, garbage out" has never been more relevant than in the context of training AI models. Imagine feeding a child a diet of misinformation. Over time, this child will develop a skewed perception of the world. In much the same way, when we feed our AI models "dirty data," they develop flawed understandings.

The quality of data determines not just the efficacy of the model, but also its efficiency, ethical implications, and real-world applicability.

The findings from Microsoft Research's "Textbooks is All You Need" only serve to emphasize this cardinal principle. The recent "textbooks are all you need" study shows that like humans, AI thrives when it learns from clean, high-quality data.

The paper fundamentally demonstrates that having "textbook quality" data can be a game-changer for AI training. Instead of using vast datasets that often contain noise, redundancies, or even errors, the researchers showed that curating a high-quality dataset akin to a well-structured textbook dramatically improves the learning efficiency of language models for code.

Efficient Learning

The research showcased that their model, phi-1, trained on high-quality data, surpassed the capabilities of most open-source models on coding benchmarks like HumanEval and MBPP. Remarkably, this was achieved even though phi-1 is smaller in terms of model size and was trained on a dataset that's 100x smaller than what many other models utilize.

Taking a closer look at model performances, GPT3.5's massive size of 175B stands out when compared to phi-1's smaller 1.3B. However, it's interesting to note that, when tested on the HumanEval metric, phi-1 still manages to outpace GPT3.5, proving that bigger isn't always better.

Mimicking Human Learning

The most effective human learners seek out high-quality educational resourceswell-written textbooks, knowledgeable teachers, and clear lessons. AI models, in essence, are no different from these learners. They are digital students. For them, textbook quality data, as showcased in the recent research, should become the benchmark.

Here's my own summary of their approaches:

Filtering with GPT-4's Aid: Before diving into large-scale model training, they used GPT-4 to annotate the quality of a small subset of code samples. It's akin to having an expert review a chapter before publishing a textbook, ensuring what gets included is of high instructional value.
Synthetic Textbook-Quality Datasets: Emulating the essence of textbooks, they didn't just rely on organic code samples. They employed GPT-3.5 to generate synthetic Python textbooks. This method served a dual purpose:
- Provides a rich source of explanatory text combined with relevant code snippets, similar to textbook lessons and examples.
- Introduces diversity by setting constraints, ensuring the model encounters varied coding scenarios and doesnt just memorize a few patterns.
Focus on Basic Algorithmic Skills: The content specifically targeted promoting reasoning and fundamental coding skills, much like a beginners textbook.
Emphasis on Function Completion: With the CodeExercises dataset, the model was trained to take a function description and generate corresponding code, akin to problem-solving exercises in textbooks. This not only tested the model's understanding but also its application skills.
Careful Decontamination: Just like how textbooks are revised to remove errors, the team was vigilant in ensuring that their training data didn't contain problems that the model would directly encounter during evaluations, avoiding any undue advantage.

Impact and Performance Details

These bars are grouped together to illustrate distinct aspects:

How Hard the Model Works: Think of it as varying study durations from a quick skim to an in-depth study session. It ranges from sifting through 26B bits of data to a whopping 76B.

Size of the Model: Models come in sizes, these bars also represent models of varying 'sizes'. The spectrum extends from a moderate 350M to a more substantial 1.3B.

Within each of these categories, three unique columns emerge, representing distinct training sets:

The Standard Source (Orange Bars): This represents how models fared when trained using regular Python files, commonly sourced from platforms like StackOverflow. Picture this as the outcome of learning from widely available study materials.

Their Unique Mix (Light Green Bars): The next column represents models trained on 'CodeTextbook', or tailored dataset. Imagine a curriculum thats been customized for optimal learning with greater connotations of examples and contexts. (the new dataset curated by the researchers)

Exercises Boost (Dark Green Bars): The last set signifies how models did after some extra practice sessions using 'CodeExercises'. Think of it as reinforcing knowledge with exercises after a lesson.

when the researchers augmented their model with the 'CodeExercises', it's score skyrocketed to 51% on the identical test. Furthermore, it astounded them with its remarkable coding capabilities. Hence, it's evident that armed with optimal training materials and a touch of practice, smaller models can indeed rival, if not surpass, their larger counterparts!

Overcoming the Shortcomings of "Dirty Data"

The Microsoft Research team noted several challenges with commonly used datasets. Picture this: attempting to learn a complex subject from a textbook filled with fragmented explanations, missing sections, and occasionally, even misinformation.

Many of the coding snippets in these standard datasets weren't ideal for teaching the intricate nuances of algorithmic reasoning. They frequently lacked the necessary context, often presenting oversimplified or trivial examples. It's akin to trying to understand a deep philosophical concept from only a single quote. Moreover, there were instances where these snippets were accompanied by insufficient documentation, leaving the AI model to 'guess' its way forward.

The implications of such "dirty" data are vast. Models trained on flawed datasets might produce unreliable or incorrect outputs. They may struggle to generalize beyond their training, faltering when presented with real-world challenges. In essence, their foundation is shaky, potentially leading to inefficient or even incorrect decision-making.

The Power of Smaller, Finely-Tuned Models

In our quest for AI supremacy, we've often assumed that larger models are the answer. However, there are pitfalls: they require more resources, have a larger carbon footprint, and can sometimes be like using a sledgehammer to crack a nut. The success of the phi-1 model from "textbooks is all you need" serves as a testament to the prowess of smaller, well-tuned models.

I invite you to read the full paper if you want more granular details beyond the summary I shared here.

Crafting a Conscious AI Future

The rapid growth of AI brings forth ethical challenges, from accountability to potential biases. Imagine AI as a student. We wouldn't want this student to cram from every source blindly; rather, we'd prefer it to learn from the most refined, "textbook quality" materials. It's not just about getting smarter but about learning responsibly.

We're in an era where 'bigger is better' is often the mantra. But with the findings from the Microsoft Research team, there's a twist to the tale. Size isn't the only metric of success. Smaller models, like their newly introduced phi-1, can achieve remarkable feats without devouring our planet's precious resources. Especially as AI takes more roles in our daily lives we start depending on AI more and more.

By ensuring our AI gets its knowledge from the best sources, we're taking a step towards responsible, ethical, and, most importantly, beneficial AI for all. And if adopting smaller models that don't drain and require huge computation resources can get us as far then we should perhaps make "Smaller Smarter" our new mantra.

How We Built Casia: A Deep Dive into Our AI Plant Care Assistant

Fares Hasan — Sat, 19 Aug 2023 12:52:35 GMT

I've always had a taste for fresh, leafy greens in my salads. Yet, as I embarked on my urban farming journey, I often found myself disappointed by the limited variety of fresh produce available in my local market. The store-bought options either lacked the freshness I craved or were just unavailable. It was on one of these fruitless shopping trips that I realized growing my own vegetables at home was the only way to enjoy the robust and varied salads I had in mind.

As I started cultivating my own urban garden, I found myself facing a steep learning curve. There were times when I struggled to identify what my plants needed, and my inexperience made it difficult to find effective solutions. It was during these moments of uncertainty that I realized the value of having an accessible, reliable source of information. My mind drifted back to the remarkable capabilities of large language models, and an idea began to take root.

The Spark

What if I could create an AI plant care assistant that could make the expertise and knowledge required for successful gardening available to anyone? Inspired by the challenges I faced in my own garden and the potential of AI to revolutionize plant care, I set out to create Casia. I imagined a world where the power of AI could democratize gardening, making it more accessible and enjoyable for everyone. My personal struggles with finding fresh vegetables for my salad were replaced by a sense of purpose, knowing that I was building a tool that could help countless urban farmers and plant enthusiasts nurture their green spaces with confidence.

The hackathon

When I heard about the Pinecone Hackathon, a spark of excitement ignited within me. I realized that this competition could be the perfect platform to build an AI-powered tool that is really useful. One of the requirements was to build a web or phone application so I went to talk to a couple of friends and told them they loved the idea of an AI-powered plant care assistant. This has given me the spirit to register my newly founded team that is 100% remote and just wants to build something real.

The Pinecone Hackathon offered me not only the opportunity to create my AI assistant but also the resources to do so. It provided access to AI tools like openai, cohere and even hugginface, AWS credits, and a supportive community of like-minded individuals. This environment was a stark contrast to my solitary journey in pondering LLM-powered solutions. As part of the hackathon, I would be able to collaborate with others who shared my enthusiasm for AI and my commitment to sustainability. This was proven when a few individuals reached out to me after the hackathon expressing their love for Casia.

Building Casia

Building Casia required multiple skills and different stages to make this vision a reality. I will explain here what we built and how we exactly built it. The core concept in which Casia was built is called Retrieval Augmented Generation or RAG. This type of system expands the knowledge of the AI LLM and enhances it with specific types of data. In our case here the expectation is that we want to reduce confabulations and increase accuracy by providing additional knowledge to the model. This data is structured and sourced from various databases, forums etc of urban gardening and relevant topics.

Retrieval Augmented Generation (RAG)

Foundational models like GPT3/4 and others are trained offline and they are agnostic to any data or domain that is created after the model was trained. Now if you want the model to be robust in performing for a specific problem domain you have two ways. Finetuning, however, is expensive, requires expertise and is not the right solution. Finetuning is the way if you want the model to acquire a new skill. In our case, we want the model to have more knowledge and that can be done with building a RAG system.

I gathered data from various resources and I tried my best that this data provided valuable context to plant care along with information on the plant names in case it was being called by different names (like scientific name and common name). These chunks of data will have the following pieces about each plant:

Description: names and varieties of the plant if any
Suitable weather: the origin of the plant, what weather and temps for it to thrive
Disease: any known diseases and how to prevent them
Care: watering and light instructions to grow a healthy plant

Different platforms have different ways to present the data and we have tried our best to make these four points present so that our model shares sound advice later on.

Creating the knowledge base

To create a RAG system you need two components. An embedding model and a vector database. Let's explain embedding. In the core of the RAG system, we are employing a technique called semantic search and this is what requires the embedding.

Embedding is a process in which our text data is converted into a vector. This vector is a mathematical representation of our piece of text. This representation is robust nowadays in giving us insights about the semantics of the text which is a level above the language syntax. This is to say that having the vector gives us the ability to understand the meaning of the text on a contextual level rather than the literal translation of the words. This is the power of semantic search.

How is this going to help in our application?

Imagine someone searching for caring tips about some plant that belongs to the genus Aloe (the species that Aloe vera belongs to). If you are searching with text matching you might not be able to infer systematically that the user is inquiring about some aloe vera cousin. However, by doing a semantic search we will be able hopefully to infer this relationship and others if existed systematically and present all of this info to our model before answering the user question.

This diagram represents our processing from collecting the plant information till we have a knowledge base created and stored in the vector database. This hackathon is by Pinecone which is the vector database we have to use as the core technology.

Retrieval

R in RAG stands for retrieval and I am sure you want to know how this work at a conceptual level. Earlier we learned that the core of the RAG system is the semantic search technique. In operation, we approach the question answering as a search problem. When a user asks a question we have to search in the vector DB for information that is semantically relevant to the user question. This information is then fed to the GPT4 model with the user question to generate the answer. Hence the name retrieval augmented generation. Here is a diagram that resembles this process:

Application

Our mobile application was built by my friend ABDULMALEK AL-SHUGAA using Expo go and Aref Asaket been in the backend. All the core components related to the RAG system are built into endpoints using fastapi which makes up our AI microservice that is integrated into the backend.

Challenges

Building Casia wasn't always smooth sailing. Like any significant endeavour, it came with its fair share of struggles and obstacles. From technical hurdles to moments of doubt, the journey was full of unexpected challenges.

One of the most significant technical challenges I encountered was ensuring that the AI model could accurately recognize and classify a wide range of plant species. Differentiating between various plants required a substantial amount of data, which wasn't always easy to come by. As mentioned in the RAG system part I have to devise creative methods for data collection, including crowdsourcing and web scraping. But all of this data might not always be in the format and shape or inclusive of the information that is needed.

Evaluating the system performance was another issue that is hard to solve at least with the tools or problem we have at hand. Merely at how the system can answer 3 random questions doesn't constitute statistical significance. However, we have to take our system at face value by observing how it performs in a few questions.

Apart from the technical challenges, I also experienced moments of doubt during the hackathon. The time constraints and pressure of the competition made it challenging to stay focused on the end goal. There were moments when I wondered if the effort was worth it, or if I was pursuing the right idea.

Cost and token size were additional concerns. Every time a user sends even a "Hi" message our system will go into full gear doing the RAG thing and semantic searching. This has been funny and completely unnecessary but I thought we shouldn't burden our system to RAG the hell out of every message. We have solved this by using another cohere model dedicated to answering short questions like this. It gave us a small room to only unleash the RAG power when the question is more than a few tokens in size, not a perfect approach but it works.

As for the token size we have to shorten our chunks of data when building the knowledge base. In addition to this, we have limited the use of the retrieved data to the top 3 results only. If you guessed using the 16k token window from the openAI GPT3.5 model, you are right we have used that version of the model after spending a lot of time being unaware of its existence.

Conclusion

I'm tremendously excited about the potential of projects like Casia to revolutionize the world of plant care and other domains. By democratizing knowledge and harnessing the power of AI, we can make plant care accessible to everyone, regardless of their background or experience. Imagine a world where anyone can easily grow their own food, create beautiful gardens, or simply enjoy the company of healthy houseplants without the steep learning curve that often accompanies these endeavours.

The potential of AI to bring positive change goes beyond plant care. In education, healthcare, and other critical sectors, AI can democratize access to information and resources, enabling more people to lead fulfilling lives. With projects like Casia, we're taking steps toward a more equitable, sustainable, and connected world where technology serves as a tool for empowerment and progress.

As for the future of Casia, unfortunately, I have not planned anything beyond the hackathon. I am open to any initiatives that would want to adopt my simple system and take to serve more audiences and perhaps grow beyond my initial vision.

Thank you for taking the time to read I really appreciate you spending your time on this. Feel free to share with your friends and spread the post.

If you want to know more feel free to comment and ask away. Also here is the repository if you want to look into our code for the RAG system

Getting the Most Out of MLOps with ZenML: 4

Fares Hasan — Sat, 17 Jun 2023 16:21:19 GMT

Recap

ZenML Steps:

ZenML steps are the building blocks of ML workflows.
Each step represents a specific task or transformation in the ML pipeline.
Steps can include tasks like importing data, preprocessing data, training models, or evaluating models.
ZenML provides a set of built-in steps, and you can also create custom steps tailored to your specific needs.

ZenML Pipelines:

ZenML pipelines define the sequence and dependencies of steps in an ML workflow.
Pipelines connect and orchestrate the steps, creating a cohesive end-to-end process.
Pipelines ensure that the steps are executed in the correct order, with inputs and outputs adequately connected.
ZenML pipelines make creating scalable, reproducible, and maintainable ML workflows easy.

Building ML Workflows with ZenML:

To construct an ML workflow with ZenML, you define and configure the steps required for your specific task. (see article 3 in this series)
Each step can be customized with specific parameters and configurations.
Steps can be added, modified, or removed to adapt the workflow as needed.
The sequence of steps and their dependencies are defined in the pipeline, ensuring the proper execution order.
ZenML provides a streamlined way to create, manage, and execute ML workflows, promoting best practices and standardization.

Stacks

ZenML stacks are intuitively MLOps stacks which you can think of as the underlying infrastructure that you operate your pipeline on. There has been a time when people built machine learning models into production before the term MLOps and if we looked at what they needed to do that we can think minimally of the following components:

Orchestrator (Airflow, Kubeflow)
Artefacts store (S3, GCS)
Tracking (MLflow)

Those were the minimal setup that we used to operate our ML pipelines with. In the ZenML context, these three components together make up what you can call a stack. Of course, different use cases might require more components that you can add to your stack.

ZenML allows you to create different stacks which empower you to have separate environments for your ML pipelines like local, staging, and prod. This also means you can have stacks that are based on different cloud providers too. The possibilities with ZenML are only limited by your imagination.

Creating ZenML Stack

In the previous article, we executed our run.py file which has the non-fatty liver pipeline. That execution has been operated by the default stack which zenml creates for you automatically. The default stack is local and all its components are also local. The executor and the artefacts store are both local and if we look at the runs record in our ZenML dashboard we can see that each artefact is stored in a local directory.

If we clicked on any of the database symbols and looked at the details view we would see that the URI is a local destination. If we used a remote artefacts store like S3 all exported artefacts will be versioned and stored to the S3 bucket.

Let's simulate creating a production stack with the following components (this is more of a simulation, actual prod will require more components depending on your requirements):

Orchestrator: Airflow
Artefacts store: S3 bucket
Container registry: ECR
Image builder: Local docker

Connect to server

Connecting to the server is the first step that we should perform so that all our work is recorded and can be viewed from the ZenML dashboard. We have deployed our ZenML server in the second article of this series using a docker-compose file which makes it easy to replicate.

I prefer that you deploy your ZenML server (to the cloud, you can get $200 from digital ocean upon signup) to keep your server and ZenML client in two different devices at least.

To connect we use the following command (username is admin as per our deployment file, password will be requested):

zenml connect --url=zenml_deployment:80/ --username=admin

Installing ZenML Integrations

To create our stack components we need to install some integrations. First, you may want to observe what integrations are pre-installed. Use this command:

zenml integration list

This will give you a table with all integrations and an indicator if the integration is installed. To install our required integrations we use the following commands:

# to use S3 as our artefacts store laterzenml integration install s3 -y# to use airflow as our orchestratorzenml integration install airflow

So far we have installed the integrations we need, but this doesn't mean our zenml can use them yet. We have to register these into ZenML components.

Create ZenML components

The components we need are the orchestrator, artefacts store, and container registry. We go about the registration using the simple commands as follows:

zenml orchestrator register  --flavor=airflow --local=True --share

My orchestrator is registered as rairflow that is to say it's a remote airflow. Observe how the local flag is False and there is a share flag which to make this orchestrator available to my team and they can use it in their own stack:

zenml orchestrator register rairflow --flavor=airflow --local=False --share

My artefacts store is registered like this:

zenml artifact-store register s3_store -f s3 --path=s3://zenml-mlops --share

My container registry is registered like this:

zenml container-registry register aws_ecr --flavor=aws --uri= --share

You should name your ecr repository zenml that just makes things easier for you while following up our steps here.

Local image builder registered like this:

zenml image-builder register local_imb --flavor=local --share

Register ZenML stack

Now we have all the components and integrations we need to create our stack. The process is simple we need a name for the stack and the rest should be familiar to you now:

zenml stack register remote_flow_stack --orchestrator rairflow \        --artifacts-store s3-store \        -c ecr_zenml \        -i local_imb--share

That will register our stack and we can see that a stack is added to our zenml server dashboard.

Activating ZenML Stack

Our stack is registered but it's not the active stack and there are also two details to get this stack working.

Login for AWS ECR:

# Fill your REGISTRY_URI and REGION in the placeholders in the following command.# You can find the REGION as part of your REGISTRY_URI: `.dkr.ecr..amazonaws.com`aws ecr get-login-password --region  | docker login --username AWS --password-stdin

Activation of our stack This step is important so our ZenML client can push the image of our pipeline into ECR. Then airflow is going to run a dag based on that image. Now let's activate our stack:

zenml stack set remote_flow_stack

Describe the stack maybe you want to look at the configurations of a certain stack. This can be done in ZenML client by simply using the describe command like this:

zenml stack describe remote_flow_stack

For our current stack, it looks like this:

That's pretty much it. You can see what stacks are in your environment using the following command:

zenml stack list

Running our pipeline in the remote stack

Our Python file that operates the pipeline looks like this (refer to the code in our third article):

from zenml.config.schedule import Schedulefrom pipelines.nfld_pipeline import training_nfld_modelfrom steps.nfld_steps import (import_data, preprocess_data,                              training_SVC, training_dct                              )def main():    # init and run the nfdl classifier training pipeline    run_nfdl_training = training_nfld_model(        import_data=import_data(),        preprocess_data=preprocess_data(),        training_SVC=training_SVC(),        training_dct=training_dct()    )    run_nfdl_training.run()if __name__ == "__main__":    main()

We need to modify it a bit to suit the remote stack this pipeline should run on. The modifications are minor and the new file (I create a new Python module just to keep them clean) looks like this:

import osfrom zenml.config.schedule import Schedulefrom pipelines.nfld_pipeline import training_nfld_modelfrom zenml.integrations.airflow.flavors.airflow_orchestrator_flavor import AirflowOrchestratorSettingsfrom steps.nfld_steps import (import_data, preprocess_data,                              training_SVC, training_dct                              )airflow_settings = AirflowOrchestratorSettings(    operator="docker",  # or "kubernetes_pod"    dag_output_dir=f"{os.getcwd()}/zipped_pipelines", #use your directory path to the zipped_pipes directory    dag_id="non_fatty_liver_training",    dag_tags=["zenml", "MLOPS", "Training"],)def main():    # init and run the nfdl classifier training pipeline    schedule = Schedule(cron_expression="5-15 * * * *")    run_nfdl_training = training_nfld_model(        import_data=import_data(),        preprocess_data=preprocess_data(),        training_SVC=training_SVC(),        training_dct=training_dct()    )    run_nfdl_training.run(settings={"orchestrator.airflow": airflow_settings}, schedule=schedule)if __name__ == "__main__":    main()

The airflow_settings help us set multiple parameters for our airflow dag and we added a cron scheduler expression to set the running interval. Now if you run this file the output will not be an execution of the code. The output of running this file - with our remote stack being the active stack - will be an image in our aws ECR + a zip file in our zipped_pipelines directory.

Using GitOps you could run the file in a pipeline and transfer this zip file into the remote airflow of your deployment. That is how it should be done and you have freedom to orchestrate this on your prefered method.

Running our pipeline in a local stack

To run locally we need to create a local stack and now this should be simple we just need to do a few things differently and create a local stack. The steps along with commands are as follows:

Install the docker operator for airflow:

pip install apache-airflow-providers-docker

We need to register a local airflow orchestrator because we want to run this locally:

zenml orchestrator register lairflow --flavor=airflow --local=True --share

Register a stack using our local airflow orchestrator and the same components we used before for artefacts store and image registry .. etc:

zenml stack register local_flow_stack --orchestrator lairflow --artifact-store s3_store -c ecr_zenml -i local_imb --share

Now this stack is registered you should set it as the active stack:

zenml stack set local_flow_stack

Now you can provision the stack which will spin up a local airflow for you:

zenml stack up

The output message of executing this command will give you credentials to log in to the airflow UI. Use generated credentials to log in and observe later how your pipeline is executed locally. This will be helpful when you are debugging or testing your pipeline before deploying it to run in the production stack.

This local airflow stack provisioning will be deprecated in the future according to ZenML new update. Be aware that in the near future running local airflow orchestrator will be different.

Running the pipeline locally is an easy process (similar to running a remote stack) with minor changes to the run script. The output will be a DAG file stored in the directory of airflow dags locally in your machine. You can then trigger your DAG from the Airflow UI to start. The run script will be different as follows:

import osfrom zenml.config.schedule import Schedulefrom pipelines.nfld_pipeline import training_nfld_modelfrom zenml.integrations.airflow.flavors.airflow_orchestrator_flavor import AirflowOrchestratorSettingsfrom steps.nfld_steps import (import_data, preprocess_data,                              training_SVC, training_dct                              )airflow_settings = AirflowOrchestratorSettings(    operator="docker",  # or "kubernetes_pod"    dag_id="non_fatty_liver_training",    dag_tags=["zenml", "MLOPS", "Training"],)def main():    # init and run the nfdl classifier training pipeline    schedule = Schedule(cron_expression="5-15 * * * *")    run_nfdl_training = training_nfld_model(        import_data=import_data(),        preprocess_data=preprocess_data(),        training_SVC=training_SVC(),        training_dct=training_dct()    )    run_nfdl_training.run(settings={"orchestrator.airflow": airflow_settings}, schedule=schedule)if __name__ == "__main__":    main()

The only difference in this script is the line inside the AirflowOrchestratorSettings that specify the dag output directory. This like is removed because its not helpful for our local airflow stack and we want the DAG to be generated into its respective directory.

Bring stack down

At the end of testing or debugging our stack, we want to take it down. This should be done in a way that removes everything that was created to provision the stack to avoid issues when you come to test a pipeline locally in the future. You can bring it down with the following command:

zenml stack down -f

Frequent errors

AWS ECR login might cause you issues so be aware that it's something you have to deal with.
Server disconnect, you should disconnect and connect to your server after you are done or when you want to start working with ZenML respectively. Leaving the connection active might cause some issues.

Repository

All the source code for this series is made publically available here.

Conclusion

ZenML proves to be a game-changer in managing the MLOps stack, offering a seamless experience with its local and remote stack creation capabilities. Although my created stacks lean towards familiar components like Airflow for orchestration and AWS cloud for resources, it's important to note that ZenML offers flexibility to tailor the stack to individual preferences. This article serves as a demonstration of ZenML's potential through a demo project, but it's worth highlighting that ZenML is a dynamic framework, continuously evolving with frequent updates and improvements.

Getting the Most Out of MLOps with ZenML: 3

Fares Hasan — Sun, 21 May 2023 14:29:47 GMT

Intro

In the previous post, we have gone through deploying the ZenML server. The ZenML server is where we can view the dashboard and monitor our pipeline and previous runs with a nice visual graph representing the pipeline. Then we looked into using the CLI to connect to the server. In this post, we will create a simple pipeline and learn how to build pipelines in ZenML.

Pipelines

This is an intuitive concept for any data scientist working with ML. We sometimes call them workflows or DAGs if we adopt the Airflow terminology. In ZenML a pipeline is your ML workflow and is built from smaller blocks called steps.

Step

A step is the equivalent of a task, a function that does one thing. This approach allows you to build decoupled and modular components (read tasks) for your ML workflow (read pipeline). You are empowered then to write a portable and modular code once, and then moving from experimentation to production becomes frictionless.

"I choose ZenML to build beyond mere pipelines. With ZenML's modular approach, standardizing ML practices for your team becomes effortless. Enjoy simplified maintenance, streamlined development, and embrace software engineering concepts like test-driven development, portable code, and SOLID principles." -Fares Hasan

Fatty liver classification use-case

So let's build a use case to go through all the concepts and make them clear. If we want to build such a classifier we can think of typical steps that go like this:

Importing data
Preprocessing (handling duplicates, and missing data)
Train a model

They are steps and all of them together make up our pipeline which we can build in zenml. We are keeping this simple but you get the idea. In some use-case, you might call this the training or modelling pipeline because all it does is train a model at the end. To implement and follow up with this part make sure you have the following:

create repository or directory to store the code
create a virtual environment (python version 3.7 or above)
installed zenml client with version identical to the zenml server version (version 0.35.1 used by this series of articles)

  # connect to zenml server:  zenml connect --url=http://my-server-address:8080/ --username=default

Dataset

You can download the dataset from Kaggle here and we will use the nafld1 file which contains 17.5k rows and 10 columns. Here is a brief catalogue explaining the columns:

Column	Defintion
id	subject identifier
age	age at entry to the study
male	0=female, 1=male
weight	weight in kg
height	height in cm
bmi	body mass index
case.id	the id of the NAFLD case to whom this subject is matched
futime	time to death or last follow-up
status	0= alive at last follow-up, 1=dead

Steps

We mentioned above the plan for using 3 simple steps to manage to build a classifier for fatty liver detection. In the first step we want to import the data and in this case, our data should be just the CSV file we have downloaded from kaggle.

Importing data

This will be just a simple step that use pandas to read the CSV file and return a dataframe. The typical code will be like this:

def import_data():     df = pd.read_csv("relative_path_to_file")     return df

Now that's a simple Python function, to make this into a ZenML step we can you decorator so it looks more like this:

@stepdef import_data():     df = pd.read_csv("relative_path_to_file")     return df

But that is not good enough we could improve it by using type annotation for inputs and outputs and we will have a beautiful function that looks better:

@stepdef import_data() -> Output(dataset = pd.DataFrame):     df = pd.read_csv("relative_path_to_file")     return df

Now you have seen how we write the methods nicely I hope you keep it up that way with the rest of the steps we will code together. One thing to keep in mind is that when we code the steps we treat them as separate Python functions with annotated input and output. In a later phase, we stitch them back together in a pipeline.

Another thought that might come to your mind is how you arrange the steps. You can go about this in any way you want, in my implementation here I have all the steps in one Python module calling it for example fatty_liver_steps but that is more of a personal preference. You can see my project file structure here. My approach in this small demo project goes like this:

fatty_liver:    pipelines:        __init__.py        nfld_pipeline.py    steps:        __init__.py        nfld_steps.py    nafdl1.csv    nlfd_run.py

Notice where I placed the CSV file for ease of use, in real project you should be reading from a data source.

Preprocessing data

There are a few things that we can do to improve this dataset to simulate real-life scenarios. Here are the steps we will perform in the preprocessing phase:

Remove redundant columns like identifier columns for the case and etc.
Impute the missing values using KNNImputer
Return the labels and features

Our preprocessing step should look like this:

@stepdef preprocess_data(dataset: pd.DataFrame) -> Output(features=np.ndarray, labels=pd.core.series.Series):    # remove the redundant columns    data = dataset.drop(['id', 'Unnamed: 0', 'case.id'], axis=1)    labels = data.pop('status')    # impute the missing values using the KNNImputer    imputer = KNNImputer(n_neighbors=2, weights="uniform")    imputer = imputer.fit(data)    features = imputer.transform(data)    return features, labels

Training models

In real-life scenarios Data Scientists run multiple experiments but in our little example here we will just train two classifiers and our training steps will take care of each model as follows:

Support Vector Classifier

@stepdef training_SVC(features: np.ndarray, labels: pd.core.series.Series) -> SVC:    x_train, x_test, y_train, y_test = train_test_split(        features, labels, test_size=0.2)    svc_model = SVC()    svc_model.fit(x_train, y_train)    y_pred = svc_model.predict(x_test)    print(classification_report(y_test, y_pred))    return svc_model

Decision Tree Classifier

@stepdef training_dct(features: np.ndarray, labels: pd.core.series.Series) -> DecisionTreeClassifier:    x_train, x_test, y_train, y_test = train_test_split(        features, labels, test_size=0.2)    dct_model = DecisionTreeClassifier()    dct_model.fit(x_train, y_train)    y_pred = dct_model.predict(x_test)    print(classification_report(y_test, y_pred))    return dct_model

Great now you must have all the steps for this fatty liver use-case in one module if you followed my approach. The fatty_liver_steps.py file should be as follows:

import numpy as npimport pandas as pdfrom sklearn.svm import SVCfrom zenml.steps import step, Outputfrom sklearn.impute import KNNImputerfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import classification_reportfrom sklearn.model_selection import train_test_split@stepdef import_data() -> Output(dataset=pd.DataFrame):    df = pd.read_csv("nafld1.csv")    return df@stepdef preprocess_data(dataset: pd.DataFrame) -> Output(features=np.ndarray, labels=pd.core.series.Series):    # remove the redundant columns    data = dataset.drop(['id', 'Unnamed: 0', 'case.id'], axis=1)    labels = data.pop('status')    # impute the missing values using the KNNImputer    imputer = KNNImputer(n_neighbors=2, weights="uniform")    imputer = imputer.fit(data)    features = imputer.transform(data)    return features, labels@stepdef training_SVC(features: np.ndarray, labels: pd.core.series.Series) -> SVC:    x_train, x_test, y_train, y_test = train_test_split(        features, labels, test_size=0.2)    svc_model = SVC()    svc_model.fit(x_train, y_train)    y_pred = svc_model.predict(x_test)    print(classification_report(y_test, y_pred))    return svc_model@stepdef training_dct(features: np.ndarray, labels: pd.core.series.Series) -> DecisionTreeClassifier:    x_train, x_test, y_train, y_test = train_test_split(        features, labels, test_size=0.2)    dct_model = DecisionTreeClassifier()    dct_model.fit(x_train, y_train)    y_pred = dct_model.predict(x_test)    print(classification_report(y_test, y_pred))    return dct_model

Pipeline

This is an easy part, we have multiple steps and a pipeline is just a code that expresses what step is executed first and manages the dependency between steps. For example, if step A depends on a previous step to run. Our pipeline code is as simple as this:

from zenml.pipelines import pipeline@pipelinedef training_nfld_model(import_data, preprocess_data, training_SVC, training_dct):    """Training non-fatty liver classifier"""    alldata = import_data()    x, y = preprocess_data(alldata)    svc = training_SVC(x, y)    dct = training_dct(x, y)

In this pipeline, you see the execution flow takes shape and our project is almost complete. The next action is to create a simple run.py script just to execute the pipeline. It looks like a Python code you are familiar with but comes with more ZenML power and it looks like this:

from pipelines.nfld_pipeline import training_nfld_modelfrom steps.nfld_steps import (import_data, preprocess_data,                              training_SVC, training_dct                              )def main():    # init and run the nfdl classifier training pipeline    run_nfdl_training = training_nfld_model(        import_data=import_data(),        preprocess_data=preprocess_data(),        training_SVC=training_SVC(),        training_dct=training_dct()    )    run_nfdl_training.run()if __name__ == "__main__":    main()

In simple words, this code is running a pipeline for training a classifier model using ZenML. Here's a breakdown of what's happening:

The code imports the necessary pipeline and step functions from the respective files.
The main() function is defined, which will be the starting point of the program.
Inside the main() function:
- The run_nfdl_training variable is assigned the result of calling the training_nfld_model pipeline function.
- The pipeline function takes several steps as arguments, such as importing data, preprocessing data, training Support Vector Classifier (SVC), and training Decision Tree Classifier (dct).
- Each step is called as a function to create an instance of that step.
- The pipeline function is then executed using the run() method of the run_nfdl_training variable.
Finally, the main() function is called when the program is run.

Run

Running this pipeline works like running any Python module by simply using the command in the terminal:

python run.py

The output should look something similar like:

Registered pipeline training_nfld_model (version 5).Running pipeline training_nfld_model on stack default (caching enabled)Step import_data has started.Using cached version of import_data.Step preprocess_data has started.Using cached version of preprocess_data.Step training_SVC has started.By default, the PandasMaterializer stores data as a .csv file. If you want to store data more efficiently, you can install pyarrow by running 'pip install pyarrow'. This will allow PandasMaterializer to automatically store the data as a .parquet file instead.....Step training_SVC has finished in 0.799s.Step training_dct has started.By default, the PandasMaterializer stores data as a .csv file. If you want to store data more efficiently, you can install pyarrow by running 'pip install pyarrow'. This will allow PandasMaterializer to automatically store the data as a .parquet file instead......Step training_dct has finished in 0.342s.Pipeline run training_nfld_model-2023_05_21-13_53_59_355523 has finished in 4.035s......

Summary

In summary, ZenML steps are the individual components that perform specific tasks, while ZenML pipelines connect and orchestrate these steps to create ML workflows. By leveraging ZenML's step-based approach and pipeline structure, you can easily build and manage end-to-end ML workflows that are scalable, reproducible, and efficient. In the next post, we will look into the concept of Stack and how we create, operate and orchestrate our ML pipeline in an MLOps Stack.

Getting the Most Out of MLOps with ZenML: 2

Fares Hasan — Wed, 03 May 2023 17:19:39 GMT

Intro

To work with ZenML there are two sides we need to create. If you are an ML engineer the ZenML server deployment would be your task and you will have to do it once. For users like data scientists who will use ZenML and or ML engineers who are building ML workflows, they will use the ZenML client. In this article we will focus first in the ZenML server and how you can deploy it. The diagram above by ZenML illustrates the architecture nicely.

The ZenML server will have a dashboard to visualize your entire MLOps ecosystem. There you can see reports on pipeline runs and failures, the stacks and components registered and it will also give information about who from your team owns the pipeline.

How to install it?

You can simply spin up a ZenML server by running the following command:

docker run -it -d -p 80:80 --name zenml zenmldocker/zenml-server

This will run a docker container with local SQLite database. This is fine to experiment with but zenml team recommends using MYSQL for production. If you want a product grade you have to also improve command by specifying the zenml server image version and it's highly recommended to use a similar version for the server and the zenml client.

Docker deployment

docker run -it -d -p 8080:8080 --name zenml \    --env ZENML_STORE_URL=mysql://username:password@host/zenmldb \    zenmldocker/zenml-server:0.35.1

Deployment with docker compose

You can run both zenml server and MySQL database containers using the docker compose. Create a file name docker-compose.yml and past the following to it:

version: "3.9"services:  mysql:    container_name: mysql_zenml    restart: always    image: mysql:8.0    ports:      - 3306:3306    environment:      - MYSQL_ROOT_PASSWORD=password    volumes:      - "$PWD/mysql-data:/var/lib/mysql"  zenml:    container_name: zenml_server    image: zenmldocker/zenml-server:0.35.1    ports:      - "8080:8080" #zenml dashboard    environment:      - ZENML_STORE_URL=mysql://root:password@host.docker.internal/zenml      - ZENML_DEFAULT_USER_NAME=admin      - ZENML_DEFAULT_USER_PASSWORD=zenml    links:      - mysql    depends_on:      - mysql    extra_hosts:      - "host.docker.internal:host-gateway"    restart: on-failure

For other deployment options you can follow from Deploying ZenML Documentation

You can simply launch this with the command:

docker-compose up -d

This will start up your zenml server with MySQL database. You can open the localhost in the browser or the IP address of your EC2 to see the dashboard. Remember the credentials username, password and change them to something more secure. If you have used the docker deployment -it's my current implementation- the user name will be default and leave the password empty

Yup that's it you have made your zenml server deployment go live!!

ZenML Client CLI

ZenML client is our main way of communicating with the server -ZenML server and dashboard-. This client will help us create mlops stacks and share them with our team and or execute certain pipelines locally before pushing them to production.

First: Install

The first thing we should do as best practice is to install ZenML client version that is similar to our ZenML server. If you have been following closely you will see that our server is using version 0.35.1 we should stick to that version when it comes to using the client. Don't forget to create a virtualenv for this project.

pip install zenml==0.35.1

Second: Connect

Once you have installed the ZenML client its time to connect with zenml server.

zenml connect --url=http://my-server-address:8080/ --username=default

Be aware that my user name is default because I used the docker deployment above. If you used the docker-compose you will use a different username.

Third: well maybe check the status

Status: zenml status will show you the current status with information that you may want to know and it looks like this:

As you see my client is connected to ZenML server that we deployed earlier and there is no stack other than the default stack which is currently active.

Summary

That was a step-by-step guide on how to install and deploy the ZenML server and use the ZenML client CLI. The ZenML server comes with a dashboard that visualizes the entire MLOps ecosystem and reports on pipeline runs and failures, while the ZenML client allows for the creation of MLOps stacks, execution of pipelines and sharing of stacks with a team. ZenML's extensibility makes it easy to scale ML models efficiently and effectively. Try it out for yourself by following the simple installation and deployment steps. In the next article we will build a simple use case to demo execution.

Reading more

I think ZenML has an amazing documentation and I encourage you to read more there for other aspects that are not covered here:

Deploying ZenML: https://docs.zenml.io/getting-started/deploying-zenml
ZenML dashboard and connection https://docs.zenml.io/v/0.35.1/starter-guide/pipelines/dashboard
Troubleshooting: https://docs.zenml.io/getting-started/deploying-zenml/troubleshooting

Getting the Most Out of MLOps with ZenML: 1

Fares Hasan — Thu, 27 Apr 2023 07:52:39 GMT

TLDR

This series of articles focuses on getting the most out of MLOps using ZenML, an open-source framework that unifies the entire machine learning stack.
MLOps is critical to operationalizing and scaling ML models, leading to better decision-making and increased efficiency.
ZenML offers features like automated caching, data versioning, and metadata tracking that simplify the entire ML pipeline.
ZenML provides Continuous Training and Deployment (CT/CD) capabilities, a pipeline workflow architecture, and a ZenML stack that brings flexibility to handle pipelines.
By providing a unified and automated solution for ML workflows, ZenML empowers organizations to scale their ML models efficiently and effectively.

What's MLOps?

Databricks defines MLOps as a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production and then maintaining and monitoring them.

Why MLOps?

MLOps is important because it enables organizations to effectively operationalize and scale their machine learning models, leading to better decision-making, increased efficiency, and ultimately, greater business success.

Without MLOps, data scientists and engineers may struggle to deploy, monitor, and maintain models, leading to suboptimal performance, wasted resources, and an increased risk of errors.

For example, imagine a financial institution that wants to use machine learning to predict fraudulent transactions. If they don't have proper MLOps processes in place, they may struggle to deploy their model to production, monitor it for performance and accuracy, and update it as necessary. This could lead to missed fraud detection, false positives, and ultimately, financial losses.

On the other hand, with MLOps and a tool like ZenML, the financial institution can streamline the entire ML lifecycle, from data preparation to model training to deployment and beyond. They can easily monitor the performance of their model, detect and address drift, and continuously improve their predictions over time. This results in more accurate fraud detection, better decision-making, and ultimately, a stronger bottom line.

How does ZenML help with MLOps?

ZenML helps with MLOps by providing a comprehensive open-source framework that unifies the entire machine learning stack, making it easy to develop, train, and deploy models with a seamless transition from development to deployment.

With features such as local development with Python, automated caching, versioning of data, and automated metadata tracking, ZenML streamlines ML workflows and saves time and resources. Additionally, it allows for collaboration among teams and the ability to visualize ML workflows for improved design.

ZenML is more than just an MLOps framework - it's a holistic solution that seamlessly integrates the entire machine learning stack, empowering teams to develop, train, and deploy models with ease and efficiency. - ChadGPT

One of the key benefits of ZenML is its extensibility, as it can be tailored to meet specific needs and workflows. ZenML also provides a single place to link up and manage different MLOps tools, with support for popular infrastructure like Kubeflow, AWS Sagemaker, Azure ML, and Vertex AI GCP.

The new thing in town is called Continuous Training and Deployment (CT/CD) capabilities, allowing for end-to-end ML workflows that can deploy models in local or production-grade environments with integrations like MLFlow and Seldon Core. By providing a unified and automated solution for ML workflows, ZenML empowers organizations to scale their ML models efficiently and effectively.

A brief overview of the ZenML workflow

Steps & Pipelines

ZenML is built around the pipeline workflow architecture. The simplest unit of the workflow is a step which you can look at as a single process or Python function. You can create your steps in any way that aligns with your use case, for example, you may have an importer step. The importer step imports data, and then preprocessing steps which will apply to preprocess your data and so on. These steps can have dependencies among your steps where some steps will depend on the output of another step. The collection of a few steps can form a pipeline.

Stacks

So to handle your pipeline you can use ZenML stack which is an abstract layer built for this purpose. For example, you may have a stack with components to handle artefact storage, orchestrator and experiment tracker. Upon running the pipeline this stack will handle the process. This brings flexibility in the sense that you may have a local stack for pipeline or a project for testing, and there is a remote stack which is for production.

Summary

MLOps is an essential part of Machine Learning engineering that involves streamlining the process of taking ML models to production and maintaining and monitoring them. Without MLOps, organizations risk suboptimal performance, wasted resources, and increased error risks. ZenML provides an open-source solution that simplifies the entire ML pipeline and provides features like automated caching, data versioning, and automated metadata tracking. It seamlessly integrates the entire machine learning stack and supports popular infrastructure like Kubeflow, AWS Sagemaker, Azure ML, and Vertex AI GCP. ZenML also offers Continuous Training and Deployment (CT/CD) capabilities for end-to-end ML workflows that deploy models in local or production-grade environments. The ZenML workflow is built around pipeline architecture with a collection of steps that form a pipeline. ZenML stack is an abstract layer built to handle these pipelines, bringing flexibility in the sense of having a local stack for a pipeline, a project for testing, and a remote stack for production.

MLOps is not DevOps but why?

Fares Hasan — Sat, 21 Jan 2023 17:11:55 GMT

After a few months as an ML operations engineer, I discovered some of the many misconceptions surrounding the field. These misconceptions came from various directions and the online space is one of them. One of the most prevalent is the belief that MLOps is simply the application of DevOps principles to machine learning. While there are certainly similarities, the reality is that MLOps is a unique and complex discipline, requiring a deep understanding of both the technical and organizational aspects of machine learning. In this article, I will delve deeper into the distinctions between MLOps and DevOps, and provide a more comprehensive understanding of the role and responsibilities of an MLOps engineer.

Definition of MLOps:

Nvidia defines Machine learning operations, as "best practices for businesses to run AI successfully with help from an expanding smorgasbord of software products and cloud services."

Databricks' definition brings in more context as it defines MLOps as a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them.

From both definitions, we can agree that our piece of software is a machine-learning model that is created during training. Maintaining such a model in production implies monitoring or tracking performance. The model changes if we have trained it with a different set of data and this brings the need for data and model versioning.

Definition of DevOps

"DevOps is a set of practices that combines software development and IT operations. It aims to shorten the systems development life cycle and provide continuous delivery with high software quality" -Wikipedia.

DevOps have been around for at least a decade now and it grew and matured over time. I often remember it with this infinity diagram.

--photo by agapeconsultinggroup

There are a few concepts that stand out, although some are shared across the ML domain they are not necessarily applied in the same manner.

Overview of the differences between MLOps and DevOps

Version control is a software famous portion that when looked at in the light of MLOps we see that it's not about just the code. We need to version our data and models but until recently there was not an easy way to do it. Version control was built for code and that was tailored to software engineering needs not machine learning engineering needs. This in simple terms means even though DevOps and MLOps do version control, they are applying them differently with different tools.

MLOps is not DevOps because it is specifically designed to automate and streamline the process of building, training, and deploying machine learning models. It focuses on the unique challenges of managing machine learning models, such as data versioning, model versioning, and model monitoring. --AI coauthored answer to the question in the title

Why MLOps is Different from DevOps

MLOps Focuses on Machine Learning

MLOps is specifically designed to streamline the process of building, training, and deploying machine learning models. Building these models and deploying them comes with unique challenges that the MLOps engineer has to solve. Data versioning, model versioning, and model tracking are unique essential challenges for machine learning that aren't addressed by DevOps. In addition to this, there are other aspects of ML system design that don't exist in the software engineering paradigm. For example features generation, provisioning, and processing. Generating features from data is a unique component of machine learning systems and comes with various levels of complexity from an operational perspective if the system is online versus offline. The necessity for all these operational infrastructures increases with the growing governance rules and reproducibility becomes more than a feature.

MLOps Requires Different Tools

Code version control has been one of the most remarkable artefacts of the tech industry today. A tool like Git grew and inspired different projects like Github, Gitlab and all the other Git-something out there. The current tooling was built with code in mind and as machine learning became an integral part of different systems there comes the need to track the model, data and performance.

To address the need for tracking and versioning there are tools to be added to the stack. Among these tools:

DVC is a tool for data tracking that is compatible with git. DVC helps you version your data and track any changes as if it's a piece of code.
MLflow brings more to your stack to streamline and facilitate model versioning via its registry and tracking of experiments with complete logging capabilities for most machine learning frameworks.

The required tools for MLOps are different and they are not as simple as alternatives to Git, but full-fledged capabilities that needed to be added to the stack of operations. The two examples of DVC and MLflow are simple examples and such tools will keep evolving.

The revolution continues with pre-trained models and LLMs. GPT and the like will bring about a new set of tooling that allows and facilitates working with foundational models. These new tools for prompts engineering and others will naturally add to the MLOps stack unless time proves a new age of LLMOps.

MLOps Requires Different Skills

MLOps requires a different set of skills than DevOps, as the field is focused on the specific challenges and complexities of machine learning. Some of the key skills that an MLOps engineer should possess include:

Machine learning expertise: An understanding of the various types of machine learning algorithms, their strengths and weaknesses, and how to implement them in a production environment.
Data science knowledge: Familiarity with data preprocessing, feature engineering, and model evaluation is essential for MLOps engineers to be able to work effectively with data scientists.
Monitoring and troubleshooting: Understanding how to monitor machine learning models in a production environment and troubleshoot issues that arise is crucial for maintaining the performance and accuracy of the models.
Understanding of ML Governance, ML Security, ML Privacy, ML Explainability, and ML Fairness.

Conclusion

MLOps is a unique and complex field that requires a specialized set of tools, skills, and knowledge to manage the specific challenges and complexities of machine learning models and data. It's important to understand that MLOps is not simply the application of DevOps principles to machine learning and that it requires a deep understanding of both the technical and organizational aspects of machine learning. As the field of AI continues to evolve, the distinctions between the two fields must become even more pronounced. Machine learning leaders and practitioners need to recognize these differences and not be held back by assumptions about similarities between the two fields. By recognizing the nuances of MLOps and approaching them with a clear understanding, organizations can more effectively deploy and manage their machine learning models in production.

Learning what to learn: a guide for intermediate data scientists

Fares Hasan — Tue, 20 Sep 2022 08:25:00 GMT

Learning technical skills is a challenging quest. The journey is not always smooth and often learners go through phases. This story is the sequel to my first part Learning How to Learn: A Guide for Aspiring Data Scientists and I want to extend to you here the map of learning. You may ask what an intermediate data scientist is and how can we quantify these levels. Frankly, there is no data-driven definition I have in mind right now, so if you would agree with me let's adopt some heuristics for the sake of simplicity.

At a certain level of your growth, curated courses will fall short of fulfilling your learning objectives. This stage of learning is often chartered by your ability to consume the primary source of knowledge like documentation to build your understanding and intuition.

Hypothetically if the data science learning journey is of some similarity to learning to code, then we can assume that an intermediate level of competence could manifest as the ability to formulate the business problem into a data product requirements/solution, the ability to validate and analyse the model performance and building a successful proof of concept. If you have been tapping into these stages of any data science product then you can safely assume you are at the stage where this story is important for you.

There are several ways in which data scientist can demo their project ideas. In my experience, there were two main frameworks that really helped me do it effectively. Dash and Streamlit were my go-to and I championed their adoption to every team I joined. If you are still in a stage where creating POC with a nice simple UI is not required this story can still hold value in showing you what to learn next.

Fastapi to build APIs

Learning to build an API is a critical skill for a data scientist thinking of getting her hands-on in production. This skill will not only be your selling point when talking to small teams that are hiring but will also be a competitive advantage for you. It allows you to maximize your impact and drive more projects to that production line. There are so many challenges when thinking of putting a model in production, but API development skills should not be one of them.

Although there are several frameworks that can be used but am biased toward a great one FastAPI. I have written a post to help you understand the features and advantages of FastAPI for data scientists. The FastAPI documentation is the best course out there, it's almost the golden standard of good documentation.

Docker and containerised applications

There has been a time when the world of tech was certainly different from what it is today. Virtualisation has simplified many of our needs and boosted how we build systems. The value of learning about containerisation and especially docker is that it brings so much ease and robustness to your products. Deploying your model API via FastAPI is implemented via docker containers and often you would come across tools that are easily deployed as containers. The use cases are a lot and today docker containers are a normal practice where your knowledge of it brings you agility.

There are a few things you need to understand gradually as you learn about docker:

What is the use-case it solves?
The difference between virtual machines and containerization.
Building a Docker image.
Running a Docker container from an image.

The best course for is from the amazing freecodecamp:

https://youtu.be/fqMOX6JJhGo

If that two hours course is a lot you may check this article by one of hashnode community members:

Introduction to Docker

Cloud services

The cloud is where all the actions take place. Building a functional knowledge of cloud services and concepts is an operational requirement especially if you aspire to take ownership of the complete lifecycle. This stands to be my personal opinion informed by my working experience in the field of data. Cloud services vendors are not a lot but 3 stand to be the most common. Amazon, Google, and Azure from Microsoft are the top 3 but there are other vendors like Digital Ocean and Linode.

Amazon Web Services (AWS) is the most common and widely adopted. Learning AWS will help you learn any other cloud and understand them in the main time ensuring that you have knowledge of the widely used vendor. There a lot of resources to learn AWS but if you are completely new to the concept of cloud maybe you can start with this AWS-made video:

https://youtu.be/gv4m1fjuthU

There are a few main services that you need to understand and have a solid hands-on building on them. These services are:

Elastic Compute Engine which is a virtual server (EC2): understanding how to create an instance, accessing using SSH, and configure security groups.
Simple Storage Service (S3): Creating a bucket, pushing a file and reading a file from a bucket using python boto3 library + via the CLI.
Relational DataBase Service (RDS): Creating an instance, connecting to the database, reading and writing data into a table.

The next services that you can learn about are:

IAM: identity and access management to understand and learn your way around access management which will improve the security of your practice.
ECR: container registry is your gateway to deploying containerized applications in the cloud either into an EC2 or ECS cluster.
DynamoDB: adding this no-SQL storage will give your solutions an added advantage in more ways than one. I start thinking you should learn it first in this list.

As you go on learning more advanced concepts the types of learning resources reduce to be left with the primary sources. The documentation of various projects is your first choice, the second source is your own experimentation and hands-on journey.

Final thoughts

The knowledge your gain from learning is compounded by a factor of magnitude when you applied in the next project. You can learn so many concepts and new technologies but that learning remains limited until you start working on problems to solve. In my experience, there are always a set of concepts that can stretch to so many use-cases and every time you apply the same skills but to different problems. The beauty of this is that your skills and experience are enriched each time you work on a new use-case.

I hope this list of recommended learning guides you through. May your accuracy by high and your models don't overfit!

Embeddings and The Age of Transformers

Fares Hasan — Fri, 02 Sep 2022 18:53:43 GMT

The prime requirement for most machine learning algorithms is numeric data. All your inputs that represent real-world entities should be translated into numeric values. Applied machine learning is rooted in features engineering techniques to produce the best results. This has brought domain knowledge to be of significant importance because crafting new ways to represent real-world entities can be done well with domain expertise. Lately, the emergence of transformers such as BERT has been transformative for natural language processing/understanding NLP/NLU.

With images, numerical representation can be as crude as using pixel intensity values. Other types of data like tabular or textual data have their ways to make these transformations. Here, I will speak in the context of natural language understanding NLU and how the field moved from the lexicon level towards the semantic representation.

Review

Embeddings are a way to represent text numerically. We do this text-to-numeric transformation because again machine learning requires us to do so. Here is an over-simplistic example of representing text for machine learning:

text1 = "I love bananas"text2 = "I love apples"text3 = "bananas and apples"embedding1 = [0, 1, 2]embedding2 = [0, 1, 3]embedding3 = [2, 4, 3]

As you see from each text we simply denoted each unique word with a single numeric value. The result of this is that we now have each statement represented as a simple low dimension vector. That vector can be processed by machine learning algorithms to perform various tasks be it text classification or others. You can observe that some words are repeated, the purpose of this is to show you have each unique word is given a unique value.

Bag of words

The bag of words technique ly a bag-like vector where every column corresponds to a unique word and each row represents one sentence. Each unique term/word will be represented by 0/1 denoting its presence in the sentence. If we use the above example we will have something that looks like this:

    I    love     bananas   apples  and[ [1,     1,         1,                0,          0],  [1,     1,        0,                 1,          0],  [0,    0,        1,                 1,           1]]

Tf-idf

Term frequency-inverse document frequency comes to enhance what we have from the bag of words. The values, instead of zeros and ones to indicate the presence or absence of a word in a sentence, are used to weigh the importance of the word in the documents. Think of it like the weight of the words and it comprises two parts.

Term frequency:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

This measures how frequently a term or word appears in a document. Documents are of varying lengths and in long documents, a term might occur more often. So dividing this by the total terms in the document is a normalization step. At this stage, all terms are considered equally important.

Inverse document frequency:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Certain terms like "the" and "is" are of high frequency but we know they are not important. Computing IDF allows measuring the importance of the term by scaling up the weight of rare terms.

Assume we have a document of 100 words wherein the word fish appeared 3 times. The term frequency (i.e., tf) for fish is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word fish appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 4 = 0.12. - Example from tf-idf*

This change in the way we can represent the sentences has brought with it substantial performance improvements. These two methods are basic and they are thought of more as approaches to represent texts digitally. The field has gone on with the early versions of text embeddings with word2vec, sentence2vec, and doc2vec. Word2vec is generating dense embeddings from words. Under the hood, words are encoded into one-hot vectors and forwarded to a hidden neural layer which produces hidden weights. These hidden weights are used as embeddings.

Pretrained models

BERT stands for Bidirectional Encoder Representation Transformer and it's pre-trained and developed by Google. The improvements here are from removing the unidirectional constraint that standard transformers have had. You can refer to the original BERT paper for an in-depth understanding of its architecture.

The second breakthrough came from Generative Pretrained Transformer GPT. The transformer architecture when added with unsupervised learning, has changed the scope of natural language understanding. This means training task-specific models from scratch are relics of the past. But GPT and BERT are the first waves of transformers, today we have ever powerful and advanced models than those changing how we practice NLU.

I firmly believe that we have entered a new era and it calls for your attention to understand how to use pre-trained models. Allow me to show you a couple of applications or perhaps use cases in which I have leveraged these advancements in my projects.

Semantic search

In today's world we - as users - expect search to be better than text matching. Users implicitly expect search results to be refined and semantically sound. This is challenging without transforms because text vectorisation can only help us learn if two texts are syntactically similar. Transformers however bridge that gap and brings about the ability to measure semantic similarity. Let's have a simple example which was mentioned in an old article.

Say you have two terms one is a drink the other is a meal. We know that both are under the food & beverages category and they should be mathematically closer than something like a gadget. If we represented the name of each item using one-hot encoding for example we will have an arbitrary representation that doesn't disclose this semantic meaning. However, if you just used a pre-trained model to give generate embeddings that represent each and then measured the cosine distance we will see that we can tell what sentences are similar. I have curated this simple example to illustrate how we can do this using the sentence transformer library in python.

from sentence_transformers import SentenceTransformer, utilmodel = SentenceTransformer('all-MiniLM-L6-v2')# Single list of sentencessentences = ["chicken burger with fries",             "coke with ice",             "iphone 13 pro with 5k pixels"]#Compute embeddingsembeddings = model.encode(sentences, convert_to_tensor=True)#Compute cosine-similarities for each sentence with each other sentencecosine_scores = util.cos_sim(embeddings, embeddings)#Find the pairs with the highest cosine similarity scorespairs = []for i in range(len(cosine_scores)-1):    for j in range(i+1, len(cosine_scores)):        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})#Sort scores in decreasing orderpairs = sorted(pairs, key=lambda x: x['score'], reverse=True)for pair in pairs[0:10]:    i, j = pair['index']    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

The output of this as per what your human experience:

chicken burger with fries          coke with ice          Score: 0.2669coke with ice          iphone 13 pro with 5k pixels          Score: 0.0684chicken burger with fries          iphone 13 pro with 5k pixels          Score: 0.0613

the first two sentences are close which is determined by our human experience and the scores confirm this. The other sentences against each other prove they are of less similarity than the first two. Now, this can too be applied to search engines in a few ways. One way is to add a ranking and sorting logic that takes the results of the search engine before they are presented to the user and rank them based on their semantic similarity score measured against the searching query or sentence.

Sentiment and text classification

If you are doing any project where are solving a text classification project, you can leverage these pre-trained models to produce embeddings. These dense low-dimensional vectors are rich in their ability to bring not only syntax but semantics-level understanding to your model. Here is a sample code to demo:

from sentence_transformers import SentenceTransformermodel = SentenceTransformer('all-MiniLM-L6-v2')#Our sentences we like to encodesentences = ['This framework generates embeddings for each input sentence']#Sentences are encoded by calling model.encode()embeddings = model.encode(sentences)#Print the embeddingsfor sentence, embedding in zip(sentences, embeddings):    print("Sentence:", sentence)    print("Embedding:", embedding)    print("")

The embedding will look something like this:

[-1.37173440e-02 -4.28515337e-02 -1.56286079e-02  1.40537489e-02  3.95537838e-02  1.21796302e-01  2.94333789e-02 -3.17523852e-02  3.54959555e-02 -7.93140158e-02  1.75878275e-02 -4.04369608e-02  4.97259833e-02  2.54912414e-02 -7.18700588e-02  8.14968869e-02  1.47068209e-03  4.79627326e-02 -4.50336412e-02 -9.92174968e-02 -2.81769391e-02  6.45046309e-02  4.44670543e-02 -4.76217493e-02 -3.52952704e-02  4.38672006e-02 -5.28565980e-02  4.33019130e-04  1.01921476e-01  1.64072122e-02  3.26996520e-02 -3.45986746e-02  1.21339737e-02  7.94871375e-02  4.58343467e-03  1.57778431e-02 -9.68210120e-03  2.87625995e-02 -5.05805984e-02 -1.55793587e-02 -2.87906975e-02 -9.62281693e-03  3.15556452e-02  2.27348991e-02  8.71449485e-02 -3.85027565e-02 -8.84718448e-02 -8.75497889e-03 -2.12343168e-02  2.08923966e-02 -9.02077779e-02 -5.25732227e-02 -1.05638737e-02  2.88311075e-02 -1.61455162e-02  6.17838977e-03 -1.23234931e-02 -1.07337432e-02  2.83353962e-02 -5.28567694e-02 -3.58617976e-02 -5.97989261e-02 -1.09055163e-02  2.91566737e-02]

This powerful representation is revolutionary because we can scale the use of the model to more use cases without the need to retrain from scratch and this saves time and effort. Transformers has been setting new records on benchmarks of NLP tasks for a few years and they are more available to you than ever before. You can copy-paste any of the code examples into a collab and start playing around with them.

Conclusion

I have been fascinated by transformers and how they changed how I do NLP and look at problems. In this short story, I hope to help you in understanding the what, why and where you can use transformers and pre-trained models. I will be glad to know the projects and ideas you will build with transformers.

FastAPI for everyone

Fares Hasan — Fri, 19 Aug 2022 03:36:01 GMT

Data science is a multidisciplinary field and this versatility brings about people from different backgrounds. Many data scientists are not equipped with a software engineering background and those are not necessarily experts. If you are a data scientist you might have tried or learned about what happens to a model after you are done with training. Particularly about taking your machine learning model to production. This is the stage in which the machine learning model is wrapped in an API to be available as a web service. Even in the case of you coming here from a web-development background, this is still relevant to you.

FastAPI is a web framework for building a python API that is fast and reliable. It's truly a good addition to your skills and special power when deploying your model falls under your responsibilities. In this preview, I would share a few advantages of using the framework over Flask.

I assume you have a brief background on APIs. For example, you can understand what endpoint, request, or query is. If you don't have any idea, you still can follow up to learn about FastAPI but do your research afterwards.

Intuitive code

Your time is precious and writing a lot of code means higher chances of bugs and errors. There are two ways in which FastAPI reduces the amount of code you type. The first is for reading parameters and the second is for data validation.

Reading parameters

This is my favourite part of FastAPI and for one obvious reason, it builds up to more and more awesomeness. One of the ways to define variables in FastAPI is the path in the same fashion as python strings:

from fastapi import FastAPIapp = FastAPI()@app.get("/items/{item_id}")async def read_item(item_id):    return {"item_id": item_id}

Reading the variables and declaring them in the path like python strings is the pythonic way. This is also not unique to FastAPI, but as you read further you will see how it builds up to make a great combination of advantages. If you declared any function parameter that is not in the path, it's automatically understood and treated as query parameter. Let's see the simple script here:

from typing import Unionfrom fastapi import FastAPIapp = FastAPI()@app.get("/items/{item_id}")async def read_item(item_id: str, query_param: Union[str, None] = None):    if query_param:        return {"item_id": item_id, "query_param": query_param}    return {"item_id": item_id}

The function read_item has one variable parameter which is item_id and another query parameter which is query. The way it's written here - =None-is just to say the query parameter is optional.

Automatic data validation

Data coming to your API can be in many shapes and formats. Validating the data before processing is important and FastAPI does this via pydantic. Pydantic streamlines data validation via python type annotations. The way this is implemented is via several methods for example in the case of query parameter we can have additional string validations. In the following script observe how we define a default value with a max_length of 6 characters.

from typing import Unionfrom fastapi import FastAPI, Queryapp = FastAPI()@app.get("/items/")async def read_items(query_param: Union[str, None] = Query(default="FooBar", max_length=6)):    results = {"items": [{"item_id": "Foo"}, {"item_id": "Bar"}]}    if query_param:        results.update({"query_param": query_param})    return results

The developer can as well add numeric validation for path parameters and other, documentation. The other way is also possible simply by using the pydantic BaseModel class which defines the parameters and their data type.

from typing import List, Unionfrom fastapi import FastAPIfrom pydantic import BaseModelapp = FastAPI()class Item(BaseModel):    name: str    description: Union[str, None] = None    price: float    tax: Union[float, None] = None    tags: List[str] = []@app.post("/items/", response_model=Item)async def create_item(item: Item):    return item

Look at the 3rd raw (counting from the bottom up) which is the decorator and observe that we have defined response_model as a parameter there. The value passed is the name of the BaseModel so FastAPI is going to use this to:

Convert the output data to the declared type.
Perform data validation.
Generate a JSON Schema for the response.
Use it for the automatic documentation systems.

Auto Documentation

The interactive documentation is automatically generated which is provided by swagger-ui. For example, if you are running fastapi locally you can access the docs from http://127.0.0.1:8000/docs. Which will look like this:(Image by FastAPI documentation)

You also have another optional alternative doc by redoc from http://127.0.0.1:8000/redoc which will look like this:(Image by FastAPI documentation)

Performance

According to techempower, FastAPI is the third fastest python framework coming under Uvicron and starlette which are the backbone of FastAPI. However, I believe this is not the only speed we should care about. The speed of development and building with FastAPI is outstanding.

Conclusion

I hope this draws the image clearer in terms of how concise and elegant the building experience with FastAPI is. This is in addition to the speed and other features I shared with you here. I came across this framework back in 2019 and have been using it for several data science projects in production. There are more points that I didn't cover here because am sure you will have a pleasant time going through the FastAPI documentation. It's one of the best-documented projects out there.

How to Rock Data Science One Project a Time

Fares Hasan — Sun, 14 Aug 2022 07:32:09 GMT

Background

Once upon a time, I opened my Jupyter notebook to show a C-suite manager my work on the project. He had never asked me to show my work ever again after that. If you didn't understand what happened here, imagine the following: a passionate fresh grad was asked by a manager to show his work. The young and broke fresh grade with all excitement presented his python code. Nobody has acknowledged my existence in this big organisation and finally, it's my moment to show my magic; or so I thought. Me presenting my code was a spectacular mistake I have done so far in my career. As time goes by I became so attached to showing my models working via prototypes.

Why was this a mistake? Well, am glad you asked. The manager doesn't understand python and couldn't care less about code in general. He was asking about some reports, insights, or any form that could represent my findings. This is a manager, not your groupmate that you just show him the output of jupyter notebook.

The essential question is how do you show your work and progress as a data scientist? Here I want to share what I have practised over years of working in the field and proven to be remarkable.

Documentation

I hope this does not come as a surprise to you, but they are by far the best ambassadors of your thoughts. I do believe that writing is not just about explaining our ideas to others, but in the first place, writing is thinking clearly. While you write the process that takes place is building connections among your thoughts. The final output is more than a document, it's also clarity.

I do believe that writing is not just about explaining our ideas to others, but in the first place, writing is thinking clearly.

What to write?There are several forms of writing or documentation one can do, but I will limit this to 3 types.

One page

This document is very simple and straightforward. It is one page that describes what you want to do. You should answer three questions:

What is this project about? You can describe the problem and give context.
Why are we doing it? Objectives that you want to achieve.
How are you going to do it? An abstract description with less technical jargon.

This document is your first page and you will write them more often. My rule of thumb is that I jump into creating this every time an idea lights up above my head. Often this is what managers will look at to get the gist of what you are working on.

Where to write this?You can write this literally anywhere that allows you to share documents or collaborate. It can be as simple as Google docs. I have used Dropbox paper and switched to Notion. Your company might have a platform for this or you can simply start one.

Performance analysis

This could be what you should think of as soon as you start progressing on your exploratory data analysis (EDA). More importantly, if you have built some machine learning model this kind of analysis document is how you keep your stakeholders informed. The medium could be a document page or even a slide if you fancy it.

Technical documentation

This document is different in nature. The intent here is to write something that describes the technicalities of the project. In this instance, the targeted audience is fellow data scientists and colleagues. Among the points that you should make clear are:

Flow of the system or the workflow preferably visually drawn to help communicate through the image.
The implementation and considerations.
Links to parts of the code or repository.

Prototyping

Data science projects under the hood could be a lot of things from machine learning models, preprocessing steps, and even data pipelines that are necessary to keep the product running. Nonetheless, your work has to be ultra simple and impactful in the beginning. That means to think of a minimal viable functionality and build it first. Building data products is way easier today than ever. There are several tools out there but I will mention two:

Dash

Dash is a framework to build web-based data applications. Written on top of plotly.js and React.js which makes it ideal for building the web interface of your machine learning models. Dash abstract and full-stack which is empowering for data scientists to build without the need to understand or manage the complexity behind the scenes. We call products built on dash "dash apps or applications".

Streamlit

Streamlit takes a data scientist to another level of building full stack applications with crisp clean web interfaces. The power of streamlit comes from a huge set of pre-built widgets with the ability to customize and be up and running in a few minutes.

How to do it?

Now that you know the most important tools, I think in a simple scenario things will go along the lines of the following flowchart.

Conclusion

The real-life situation might be different but I hope this draws the image for you. One step omitted here (in the above flowchart) is the technical documentation which is something that can be done once the product is deployed or iteratively written while development is still ongoing.

Through this text I hope you have learnt about:

Documentation: types of documents and to whom we write them.
Prototyping: Python Frameworks that are used to build an interactive app.
MVP: Thinking of a minimal viable product for your project.
Workflow: First steps to take when you have a new idea or concept.

Remember the following quote by Eric Ries:

The only way to win is to learn faster than anyone else.

Learning How to Learn: A Guide for Aspiring Data Scientists

Fares Hasan — Sun, 10 Jul 2022 17:43:08 GMT

Data science is an exciting field with vast areas of applications and skills that you need to acquire. If you are reading this I am sure you have been reading long lists of resources, courses and books. I am trying to do something different here which is minimum viable knowledge to get into data science. Let's call it MVK for data science. To be fair and realistic I have to tell you the not very good news, data science means different things to different people. At some companies, it's extensive excel work whereas at some others it's spark jobs and parallel processing. In between the two extremes lies most of the companies with requirements that vary in technicality.

This MVK guide is meant for mainstream data science practice. It's hard to say what's mainstream today, however, I would like to say that skills and knowledge that are core and without any advanced or special requirements. This curated list of skills and resources are of my own so they represent my experience and personal opinion. Almost every course here is something I have taken personally or recommended to more than one of my mentees.

Minimum Viable Knowledge (MVK) for data science to help you start solid

Fundamentals

These are core data science skills and a person must be exposed to at least the main concepts. I do think that having an overview is good enough, however, some depth of knowledge accumulated will take you further in your journey. I don't claim that you need to know everything in this guide to land an internship or an entry-level position. Nonetheless, the more you know the higher your chances are because you set yourself on top of the list in terms of potential.

SQL

There is no training data set in real life, that pristine CSV doesn't exist. There could be a data warehouse or multiple data storage systems with hundreds of tables where the organization stored their data. It's your responsibility to dive in and build a dataset for the use case at hand. Since almost every data storage has SQL or SQL-like medium to interact with data it's primitive to know how to SQL your way into these systems.

I would go further to say that you will spend 3 quarters of your time interacting with data. Pre and post-modelling work require you to perform a variety of tasks from explorations to analysing data. Being fluent in SQL is essential and therefore you should get yourself into good SQL practices as you grow.

Python

There are more courses and books for learning python than I could count. So, here is what I think is the best book/course for you, it's what you find good for you!! Myself, I loved the book "Python for informatics" by Charles Severance and was delighted when I found the course on Coursera "Python for Everybody". The book/course was made for beginners and curated examples and problems that were interesting enough to keep me engaged. Though it's a general python course hence it builds your skills in python as a programming language. I mean this is not the course that teaches pandas and NumPy.

The book is now mostly known as python for everybody and is available as a course on most platforms. It's also available for free online along with its exercise, code and pdf version.

Book PDF:
http://www.do1.dr-chuck.com/py4inf/EN-us/book.pdf
Book Online:
https://www.py4e.com/html3/
Youtube playlist:
https://www.youtube.com/playlist?list=PLlRFEj9H3Oj7Bp8-DfGpfAfDBiblRfl5p
Github source code:
https://github.com/csev/py4e

Now that python as a programming language course is out of the way, you may ask, what about a data science-oriented course. This one you can learn from different resources as well, however, I would recommend the first two courses from the Coursera specialization "Applied Data Science with Python" By the University of Michigan. There is a steep learning curve to some extent as the learner move from the first course to the second.

Applied Data Science with Python:
https://www.coursera.org/specializations/data-science-python

Machine Learning

The best course in machine learning is the one on Coursera by Andrew NG. Since its launch in 2012, millions have taken that course including me. The second best course is the new version of the same course by Andrew NG launched this year 2022. It's a remake of the first course with switching to teaching python. I cannot give this course justice no matter what I say. There is more to the course than just the topics, the impeccable delivery of concepts is just unmatched.

Machine learning specialization by Andrew Ng:
https://www.coursera.org/specializations/machine-learning-introduction?

Statistics

Machine learning is the new statistics

Regardless of how true the statement is, what I want you to recognise here is the correlation. This correlation is an anecdote of connection between the two fields. Therefore, having a solid foundation in statistics will strengthen your ability to understand a lot of the inner workings of machine learning algorithms. Here are a few courses that I found to be interesting:

Introduction to Statistics by Stanford (Free course):
https://www.coursera.org/learn/stanford-statistics
Statistics concepts explained:
https://www.youtube.com/playlist?list=PLVf6vX-h44c2KX5ydZ5NCyy5ly_elT9QT

I believe that having a grasp of the concepts is essential to make sense of various topics and new concepts in data science. It also helps build a good intuition. Though knowing deeper details and how to calculate is important, you can understand more as you gradually get exposed to more cases and problems.

Auxilary skills

Some skills make you a better candidate or set you up for success in data science. They are not talked about a lot, but they are expected to work in the data field or even generally in tech. Here are the most important ones you should pay attention to.

Communication

Data scientists do various tasks in various organizations, but there is one task that is in common. They are in constant need to present results and insights. This single skill draws the distinction between data scientists. In data science, finesse is not about building the 98% accurate model only, but rallying everyone to ship a data product into production. You need to be a great communicator to be able to drive projects and make people in your organization take your insights further into their implementation.

Communication to me has the following aspects:

Understanding the matter at hand. You are the subject matter expert and you should understand the matter more than anyone in the room.
Content structure. Your ideas need to have a flow that connects and makes a cohesive story. The results have to be organized in a way that is smooth and linked.
Delivery. Here lays the make or break of your quest the ability to deliver results or insights with as little friction as possible. Success here could be in the form of people's positive reactions or questions about how to move to the next step.

There are no courses to recommend, it's more of a process and continuous learning. However, here are a few tips that helped me along the way.

Learn in public by tweeting and Start a blog in Hashnode and document the topics you learned. This will connect you to more people and build your network.
Get involved in the tech community in your town or online via discord and others. Step up at times and explain what you have learned over the weekend in an hour virtual call. The more you speak in meetups the better you communicate.
Try to speak to people from non-tech backgrounds and perhaps try to explain Ai to them in simple terms. This helps you learn to abstract complexities and people will like you for it.

Bonus video from MIT which I found to be interesting and helpful:

https://youtu.be/Unzc731iCUY

Git and version control

Version control systems are nothing new and I thought not to include them here. However, after seeing the number of beginners struggling to make sense of it during internships I thought to bring it up. You must have the basic idea and know your ways around Github. You need to know 3 basic functions, creating a repository, cloning, and pushing changes.

Git and Github crash course:

https://youtu.be/RGOj5yH7evk

Command line/bash

Working with the command line is a good skill to add. Since most of the production servers are Linux based, your familiarity with the command line will help you do more.

Hands-on Introduction to Linux Commands and Shell Scripting:

https://www.coursera.org/learn/hands-on-introduction-to-linux-commands-and-shell-scripting

What's next?

Continue learning, and understand the style of learning that works best for you. In my journey I found case-based learning to be the best. For example: when I have a problem be it my model performs 100% accuracy in training but in testing it drops to be 60% accurate. I can go from there and understand all the needed work to solve this problem. There most of the concepts I learned theoretically make more sense and become solid.

At last

If you wondering where is the math, calculus and algebra courses? Well, don't worry about it.

In my personal opinion:

You need math when you feel your math background doesn't help you understand a concept you are currently learning. At that point go to youtube and understand that concept.
You don't need to build algorithms from scratch, it's more important to understand the intuition of algorithms. The ability to code from scratch could be a hobby but don't really expect people are building algorithms from scratch at their day-to-day job.

I hope this short list is minimal enough and yet meaningful to help you get started or perhaps brush up on some skills. If you have taken any of the courses listed here feel free to comment and share your thoughts.

3 Lessons I Learned Building Customer-Facing Data Science Products

Fares Hasan — Sun, 23 Jan 2022 07:57:59 GMT

Photo by Isaac Smith on Unsplash

When I took my first ML course (By Andrew Ng) in 2015, data science wasnt something that rings bells, at least not for me. Back then, I was just curious about data, and the possibilities it could unleash, I had participated in an analytics hackathon where I somehow managed to place 3rd. That was not an expected outcome, considering the little skills and knowledge I had. Learning hive queries on the weekend was odyssey and Hortonworks did not even work on my laptop. I felt the least prepared, and the little SQL along with some reading on Big Data was all I had in my toolbox. Winning that day lead me to believe I should pursue a career in Data.

Then, there were only two ways in my mind in which data could deliver value, either creating analytics or serving predictions. 6 years later, I have come to learn even more than I knew possible. The data science field exploded my simplified idea of just analytics and predictions and Ive had opportunities to pioneer some of these in the companies Ive worked at.

My first customer-facing data product is a recommendation engine which I have written about before here. I worked on several more projects later on around recommendations, relevancy ranking engine for search tool as well as some others. These were all full-fledged customer-facing data science applications and as I developed these, I have come to appreciate the distinct lessons that come from building data-fuelled tools that influence real-life users experiences. Here I will share the most important lessons in my journey so far.

A data product is described as an application or tool that uses data to help businesses improve their decisions and processes.Dr. Carlo Velten.

Take a hybrid mindset

Many new data learners come with a fascination of complex machine learning (ML) models. They want to try deep neural networks in most problems and due to that, often hyperparameter tuning comes before any thought of feature engineering. I would not claim that the data-centric approach is the best, but I would still advocate for thinking about your data first. In many industry problems, algorithms can only take you halfway. Going the rest of the way will lie in the details of your data and the quality of the features you have crafted.

These are also popular opinions of Prof Andrew Ng: his amazing talk on data-centric AI.

Lets be real, context is the most important and therefore a data scientist should optimize his work based on the problem they are trying to solve. Some limitations such as wanting rapid results could determine the options you can explore for a particular problem.

When we want fast results, engineering new features, or devising preprocessing techniques might be the less efficient way forward. Despite my affinity for the data-centric approach, one should understand that some data issues cannot be fixed sustainably downstream. So, be context-aware that there could be times when a working product is what you need to ship first.

In the early phase of the project like when you are building a proof-of-concept (POC), go with the path of least resistance. Getting buy-in is more important than an extra % point in the model accuracy. If you are building for production, you can guess-estimate which approach can improve the performance and optimize with that in mind. In future iterations, you can validate both data and model-centric approaches. That is the time when you can adopt a hybrid approach and explore the best ideas.

Business metrics first

Let me put it this way, business is king. I have spent too much time pointing out my RSME scores or F1 score to people who dont care. Data scientists are taught in kaggle and in academia to strive for the extra X-% of accuracy, but in the industry, data scientists are expected to bring value and impact. The statistical metrics do not show the value or impact to your business stakeholders. Your stakeholders dont understand how your RMSE translates into metrics they care about.

As the field matures and companies focus on the usefulness of data science, one should pay more attention to building her/his business acumen and learn how to prove the impact of his work.

Its one of the things that took me a long time to learn. In fact, learning this took me longer than learning backpropagation and gradient descent combined. You have to figure out the language and ways to articulate your models impact in terms of revenue, clicks, and conversion.

UI matters so much

Fine-tuning parameters and feature engineering can take you far but doesnt matter if user perception or appeal of the product is lacking. This portion of the data science project is the toughest as it requires A/B testing and synergy with the products and developer teams to help facilitate the experiments. Customer-facing products have the attention of the customer as a pre-requisite and from experience, you can drive significant impact from improvements to the UI.

Additionally, the positioning, wording of the product title are all part of creating the appeal. It goes without saying but often the excitement of having that project out in production might blind the data scientist from these details. As an example, my recommendation carousel was titled items you may like resulted in mediocre returns. Interaction spiked from just tweaking the shading of the carousel and using For you instead which portrays the personal touch of this section.

Finally

There are many more lessons but these 3 lessons were the most expensive in terms of time and perhaps most underrated. During this journey, there were times of frustration when I thought my manager or stakeholders are unreasonable. It was tough to work for weeks only to be hindered deployment. Only when I allowed myself to see past my work that I learned what my humble self should have realized earlier. I took time to learn some of the rules and I hope you benefit from this to play a better game in your next project.

Beneath the hood: Fave Personalization

Fares Hasan — Wed, 11 Nov 2020 08:48:33 GMT

Photo by Stephen Leonardi on Unsplash

At Fave, we in the data science team actively research and innovate new ways to contribute to helping more SMEs become digital. The transformed data team went through a courageous process to expand its impact from business intelligence and reporting to building data products. Personalisation stands as one of the essential data products in Fave today, spreading its integration into our consumers, merchants and the Fave app. This motivated us to share how we built a robust recommendations engine in-house, given the unique characteristics of Faves offerings and products.

Faves most known products are FavePay, a mobile payment method designed to allow users to transact with offline merchants easily, and Fave Deals. The hyper-growth of Fave as a platform and the ubiquity of FavePay in Malaysia and Singapore today brings with it new challenges. In the early days, products (deals/discounts) were listed via hand-crafted localisation logic that brings the highest selling products to the top of the list for each user in a specific neighbourhood (around the user location). It worked but it couldnt scale as Faves inventory covered more and more categories.

As we grew, we have observed a few shortcomings and opportunities:

We have a long tail problem (a minority of items outperform the vast majority of items).
New deals will get a slim opportunity to be seen by users compared to older top selling deals.
We have so many offerings our users dont know exist.

Long-tail

As the platform grew in its offerings, fewer and fewer people are able to know what they could find in our app. Users are fed with the highest selling items at the top of the list, so they become fixated on specific deals made visible to them and this feeds sales to that small subset of deals.

It's a real example of an iceberg where your customers can only see the tip of what is on offer. Even as we actively push other products via marketing channels, the iceberg body will still go unnoticed as the platforms inventory grows.

Solution

In our analysis of the problem, we saw that our early logic was breeding into the problem by increasing the visibility of a small set of top selling products. The most optimal solution would be curating products based on something better than location and definitely not random. In the age of data science, that is what can be achieved by a recommendation engine. Personalisation could be one prescription to improve many aspects of our growth. Increasing the visibility of more deals will increase the potential of conversion while offering users a unique set of deals with the highest likelihood of being favoured by the user.

Challenges

Like most great ideas, there is always a but. Recommending Fave Deals, unlike Netflixs movie recommendation or Amazons book recommendation, have some constraints that are relevant only to Fave.

At Fave, personalisation should adhere to the following constraints:

1- Live deals only; our deals are short-lived, and given the lifestyle pattern of Fave being an everyday app, we cannot afford recommending a product that is not live anymore.

2. Location proximity; as a lifestyle app, insights on user behaviour informs us how our users launch the app and transact. Users wont travel too far out for a discount. We help users save by discovering deals that are around them.

3. Affordability; relevancy becomes pocket-size too. The price range of deals suggested to a user must not exceed what they can afford.

4. Diet restrictions; As much as possible, we optimise for dietary restrictions. For example, we wont recommend meals that include pork or alcohol for Muslims.

There have been a few tools in the market that offers a recommendation engine as a service but the challenge is our use case and our circumstances might not necessarily be applicable by that service. In our scenario, there are limitations to using external services: those services probably cannot solve issues that arise in our specific context. For example, SurPRISE algorithms were handy to use but could not optimise for Faves constraints and therefore performed poorly in relevancy metrics.

We cannot simply pick a recommendation service or a model from a framework. Though that is easier to do, it is proven to score low in relevancy metrics.

Approach

Our solution was to implement a hybrid form of collaborative filtering (CF) recommendations. CF generates predictions about the users interests by looking at a collection of preference information from a group of users, hence collaboration. Bear in mind that these predictions (recommendations) are for a particular user, despite using information discovered from a group of users.

At the end of each day, the engine is fed with a set of data containing the following:

Purchase date/time
User behaviour in the app
Clicked image
Location
Price range
Merchant name
Category of the item
Demographics
Device information

There are two types of data consumed by the model. These are explicit data which is simply transactional in nature: what deal has the user purchased in the past? and implicit data such as the items/images the user has clicked on. By combining explicit and implicit data types, we cover a wider spectrum of the users behaviour, tastes and preferences. There will always be a wealth of implicit data in every organisation as users will browse through several items and perhaps add a few to their wish lists before transacting. As a result the ratio of explicit to implicit is always 1:m. It is then strategically important to find ways to incorporate such implicit data into our systems and find ways to capitalise on it.

Fave recommendation engine concept

The model builds a user profile in a multi-dimensional space where a user is represented by a matrix of what they have purchased/shown interest in. Successfully applying machine learning depends on its data and the way data has been treated. Applied machine learning is basically feature engineering (Andrew Ng). So a user matrix can be only as good as the data we use to build it.

Semantic encoding

The features are what make machine learning algorithms work. Domain knowledge is the backbone of an effective features engineering process. Lets take a simple example from our data, the category feature of each item is descriptive nominal data like the food category or the beauty category. We will use an encoding method to translate the nominal categorical values into numeric values in a typical machine learning problem. The one-hot-encoding method is one that is used for this purpose but it is limited in the depth of information it represents for our context.

Lets think of a deal for a 2 person dinner in a Thai restaurant. It belongs to the food category. How would you compare this deal to another deal from the drinks category?

one-hot encoding problem

if Joe bought the 2 person Thai dinner ( food category) and Kamilla bought a deal for a 2 days staycation (travel category) and we have the signature drink deal to recommend. Which user is closer to the signature drink?

Looking at the image, you can see that the first two deals are similar but the similarity isnt represented in the encoding. Both are food and beverages respectively and if we encode using the one-hot method, we will end up in a totally different conclusion mathematically. To understand this, you can calculate the euclidean distance between the food and drink (we calculate first 3 values for simplicity)d=1.414214. If you tried the same calculator to compute the distance between food and travel, d is also 1.414214. By this calculation, you can see that one-hot encoding features adds a naive dimension without transcending any semantic information (human knowledge).

This problem can be easily solved by using some function to calculate semantic similarity in the two texts of category. In this context, imagine something like:

f(drink, food) = 0.79

Now the drink deal vector will look like this:

[0.79, 1, 0 ,0]

In our human knowledge, food and drink are similar, and using the one-hot encoding method alone will not give us such knowledge. The semantic similarity by mapping human sentence to a higher dimension space gives our basic encoding the details it needs to be meaningful.

What is the harm of using pure one-hot-encoding? Well, in an engine that doesnt understand some aspects of the real world, we risk getting higher irrelevancy in our recommendations. If left with vague representations, the machine will fail to make sensible associations and will contradict our human world and that defies the purpose of building a recommendation engine: generating a relevant and interesting list of products for the user.

The machine if left with vague representations will fail to make sensible associations and will contradict our human world

Localisation of recommendations is another benefit gained by enhancing using location encoding. If we have 3 unique locations: Cyberjaya, Putrajaya, and Dengkil and our deal location is tagged as Dengkil, we will want to represent distances in our encoding and not arbitrary categorical values.

Distance matrix

If you have a deal tagged Cyberjaya you will have a vector [0, 3, 11] with such a distance-based encoding (you may even scale the values between 01). We allow for a more realistic representation of the real world by doing this.

You can see how Faves RecSys goes beyond and attributes its success to the amount of attention paid to the details and these details are best known by the team most familiar with the data. Our engine is localised and understands that you should get recommended something in Georgetown, Penang if you have been shopping around there and this is what brings our users back for more interesting deals.

Workflow

The whole workflow to get from data to recommendations is fully automated and we leverage Airflow as our workflow orchestration system. It takes 12 steps to reach the final recommendations that then can be consumed by the product. In Fave we are using multiple channels to deliver a personalised experience from direct marketing emails (EDMs), push notifications, through the home page of the Fave consumer app and across all possible channels.

Fave RecSys workflow

In the early days of developing the engine we have tried generating recommendations on-the-fly whenever a user launches the app. The process was limiting in several ways like it worked only within the app and could not serve other channels. Changing our approach to pre-generating recommendations and storing them in an inventory for all users at least once a day brought several benefits. Another change we made was to utilise real-time data delivered by our on-the-fly transformations solution and that made it possible for tonights recommendations to train on this mornings user-app interactions.

Personalisation across channels have increased our conversion by 7x

This inventory of recommendations made it possible to curate personalised content for EDMs and have time-optimised push notifications. For example, sending lunch deal recommendations via push notifications to users just before lunch time. In a way, integrating personalisation with marketing allowed the engine to be tweaked to focus on specific categories, specific price ranges or even specific type deals.

Summary

In our first beta test we couldnt immediately recognise the difference brought by personalisation and we thought that we needed more time to monitor and validate the impact. For a data science project to come out of its POC phase and pass several experiments in production is a huge leap for us. In fact this is the first successfully productionised data project by our team and is Faves first ML model in the app. To get this far is not the rule but the exception, since most data science projects dont go beyond the POC phase. The challenge isnt just to do with the predictability of the model. There are also organisational and contextual factors that impact how likely a data science project will be productionised.

This project has matured over the course of several months and we are excited to democratise personalisation across Faves offerings. Today our users across 3 countries receive personalised EDMs and browse through a personalised home page. The future is personalised and so is your Fave experience.

Acknowledgement: the first ever attempt for recommendation in Fave was by Evonne and that paved the way for me to build upwards. Husein Zolkepli supported optimisations and scaling of the project. In Faves Data Science team, we are empowered to try new things and learn all the time and that is to the credit of Jatin Solanki. Over the course of this project there were countless others who influenced and supported this endeavour and I am grateful for every single one of you.

The easier way to handling large files in pandas

Fares Hasan — Mon, 31 Dec 2018 10:32:44 GMT

Cute panda from pixabay

I have been using a high-performance laptop for some time and EC2 machine for when I needed high computing power. I never encountered issues of performance for as long as I remember with both tools giving me quite a comfortable processing power. Things started to seem awfully different when I downgraded my MacBook and cannot enjoy the privileges of high computing powers on my disposal.

This has forced me to dig down and learn more about Pandas library. So if you are trying to learn about tips to handle your multiple gigabytes files in Pandas this gonna help you. To my surprise batch processing wasnt the easiest way around handling your file, well in most of the times you are looking into a way to reduce the files size without getting into batch processing.

Data types in Pandas

As you know in python we have several variants of the same data type. So for integers, we have int16, int32, int64 and for floats, we have float16,float32, float64. These are extra memory capacities that we might or might not need. Pandas library in default it reads your data in the 64 variant. So, your age column will be between 199 and it will be assigned to int64 data type.

Int64 is 8 bytes and can store the figure -9223372036854775807 up to + 9223372036854775807. The question is do you really need int64 for your age column? This is the question you have to ask yourself for each column in your dataset. When you know the answer then it will be easy for you to shrink the size into something hopefully your machine can handle.

In a nutshell, you need to give each column the proper data type without instead of the Pandas library default type.

Shrinking your file by 87.5%

I have a CSV file of 2.2 gigabytes in size and shape 117180 * 2502 for a sparse matrix that I take me a lot of time to load each time I run my Jupiter notebook. I will demo for you how I can take this file into something my machine can handle.

Loading my file (dummy data

my dataset as it look

a few more columns

As you see I have the majority of the columns with small values or Zero values and just two columns with integers that are not large per se. The int16 data type can store values from -32768 to +32768. We technically can use this for our A and B column in this dataset.

Lets look at the data frame info:

Look at the dtypes

We can check the file size and convert it to gigabytes using the following:

start_size = getsizeof(data)/(1024.0**3)
print('Dataframe size: %2.2f GB'%start_size)

Converting column data types in pandas

In my case, there are so many columns to be exact its 2502 columns and by no means, I can go through each column so I used a for loop to convert all the columns of the data frame.

for col in data.columns:
data[col] = pd.to_numeric(data[col], downcast='signed')

Now you may ask what is the downcast parameter? Simply if your columns are float64 like the dataframe I have above, downcast will simply convert them into float16 and if the case they are int it will downcast to the smallest variant of the data type.

downcast takes only the following values: {integer, signed, unsigned, float}
- integer or signed: smallest signed int dtype (min.: np.int8)
- unsigned: smallest unsigned int dtype (min.: np.uint8)
- float: smallest float dtype (min.: np.float32)

If you are to use the to_numeric method for each line then you will have to something of the form of:

bigfile['A'] = pd.to_numeric(bigfile['A'], downcast='signed')
bigfile['B'] = pd.to_numeric(bigfile['B'], downcast='signed')

For my 2.2 gigabytes file, the for loop took quite some time and this is why you should find better ways to perform the conversions. The result of this conversion was that my file went down on size to:

Do you want to know the percentage?

print('total size reduction: %2.1f'%((1-final_size/start_size)*100))

Size reduce by 87.5%

For me, this has solved my issue and cannot let pandas assigned all type 64 to my columns because most of the time the values are too small and therefore the efficient use of memory is the solution for now.

I hope this helps you to navigate around large files (large for mediocre machines I mean) and you can crunch more data.

Have a happy new year.

2019

The easiest way to build an API for your Machine Learning model

Fares Hasan — Fri, 09 Nov 2018 15:38:09 GMT

image from the public domain

Among the most frequent questions I get asked about is what is next after We have prepared the data and trained a model and went further to test the performance? Usually in academia we stop at this point and its time to put together a research paper that will hopefully get accepted with less brutal comments from reviewers. Although lately more of the academic work have been made available in which the community benefited from this and brought more enthusiasts to ticker around with these model. In the other hand there are more software engineers who get stuck and often expressed the radical candor and call machine learning useless. So, if you stuck with people practicing the academia concern of building a model that performs at 1% higher in accuracy this post would be your way to get your models up for deployment as an API.

Intention

The intention of sharing this is to demo how easy it is to actually deploy your model. In 30 lines of code your model will be ready to for scaling which I believe is as important. I dont claim that this model or this API is production ready because it is just one step into and for sure there is more to production than just build a simple api. This demo is a simplification of how your model can take its first step. At the end of this publication I will tell you what needs to be done before you actually go live with this api.

Preface

A typical machine learning workflow will look like the diagram below:

For the sake of this fast demo we will go fast through the steps with less details and processes than what it actually takes in real world project. Get your Jupyter notebook up and follow the next steps.

1- Getting the Data.

The dataset we gonna use here is from here but I cannot access that no more so you can get it from kaggle. The dataset from kaggle is in comma separated values format (CSV) whereas the version I have is in txt file. The dataset has 7 columns:

buying price.
maintenance cost.
number of doors.
number of persons.
lug_boot.
safety.
traget or decision column.

Basically these features describing the car condition and the last column gives a category if the car is unacceptable, acceptable, good, very good. And in this problem we are trying to build a model that can predict the decision of new data as one fo the four categorise.

Reading the data

First we read the data into pandas data frame:

data = pd.read_csv(car.data.txt, names=[Buying, Maint,doors, persons,lug-boot,safety,target])
data.head()

I named the columns as my txt file has no headers but if you have the kaggle version of the dataset then you will see the headers available and therefore there will be no need for the names parameter.

this how the data looks like from the txt file

Reading the data into pandas dataframe will present them in tabular format:

looks better in tables

Pre-processing

As you can observe from the table most of the values are categorise and machine learning models expect numeric data as an input to train the model. So, you will have to transform each column into a numerically encoded values. There are multiple ways to do that using scikit learn methods, but here just for demo purpose I will code my own simple way as follow:

I have created one for each column

As you see the function is simple and it caters for each case. Definitely this way has issues but just to clarify the method for you. You can use any other method be it one-hot-encoding or label-encoder from scikit-learn. They do the same thing. We use the apply method to perform the Trans_Buying and you will have to use the apply method for each column to transform the columns.

data['Buying']=data['Buying'].apply(Trans_Buying)

After we are done with transforming all the columns into numeric data then we just need to split the data into training and testing.

# let's split data into training and testing
feature_columns = ['Buying', 'Maint','doors', 'persons','lug-boot','safety']

labels = data["target"].values
features = data[list(feature_columns)].values

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)

Now that you data is ready to be fed into the algorithm for training you can try multiple algorithms to train. The algorithm with better performance can then be used for deployment.

Random Forest Classifier

RFC = RandomForestClassifier(random_state=101)
RFC.fit(X_train, y_train)

print("Score on the training set is: {:2}"
.format(RFC.score(X_train, y_train)))
print("Score on the test set is: {:.2}"
.format(RFC.score(X_test, y_test)))

Decision Tree Classifier

DTC = tree.DecisionTreeClassifier(criterion = 'entropy')
DTC.fit(X_train, y_train)

print("Score on the training set is: {:2}"
.format(DTC.score(X_train, y_train)))
print("Score on the test set is: {:.2}"
.format(DTC.score(X_test, y_test)))

Exporting the model

Now after you have trained supposedly multiple algorithms and you have done your performance analysis and decide in which model to take for production. But first we need to serialize the model / export it into a pickle file.

# Create persistent model
model_filename = 'carDTC.pkl'
print("Model exported to {}...".format(model_filename))
joblib.dump(DTC, model_filename)

To this stage you have already done with the machine learning workflow and next we need to create the Flask API. Flask is a web development framework from python. We will use it here to create an api for our model and we will deploy locally.

Flask API

Before you start in this phase please make sure you have isntalled Flask library. Then create a file in your favourite text editor and name this file filename.py

I use sublime and sometimes Atom feel free to use whatever you are comfortable with. Next we gonna create our Flask app and import some important libraries. Loading out serialised model that we have created before this.

# importing the libraries
from flask import Flask, request, jsonify
from sklearn.externals import joblib

#Creating our FlaskAPP
app = Flask(__name__)

# Load the model
MODEL = joblib.load('carDTC.pkl')
MODEL_LABELS = ['unacc', 'acc', 'vgood', 'good']

I have created a list of the classes I have because I will need them alter when the model present the prediction. In the next step we create the endpoint with predict function that is supposedly where we received the inputs and then make the predictions.

[@app](http://twitter.com/app "Twitter profile for @app").route('/predict')
def predict():

# Retrieve query parameters related to this request.  Buying = request.args.get('Buying')  Maint = request.args.get('Maint')  doors = request.args.get('doors')  persons = request.args.get('persons')  lug\_boot = request.args.get('lug\_boot')  safety = request.args.get('safety')

# Our model expects a list of records
features = [[Buying, Maint, doors, persons, lug_boot, safety]]

# predict the class and probability of the class
label_index = MODEL.predict(features)

# get the probabilities list for the prediction  label\_conf = MODEL.predict\_proba(features)

# list down each class with the probabilty value
probs = ' Unacceptable = {}, Acceptable = {}, Very Good = {}, Good = {}'.format(label_conf[0][0], label_conf[0][1], label_conf[0][2], label_conf[0][3])

# Retrieve the name of the predicted class
label = MODEL_LABELS[label_index[0]]

# Create a JSON and send a response
return jsonify(status='complete', label=label, confidence = ''.join(str(label_conf)), probabilities = probs)

Flask will handle the input data from the request and we just need to put these pieces of data into a list that has the same shape as expected by the model. Next step we feed the list to the model and we get the classification result in label_index and by that we try to get the class name from the model_labels for an added step here i tried to use predict_proba which is a method to retrieve the confidence of the model in the classification for each class. In the return function we get everything together using jsonify method to send the response to the API caller.

Congratulations you have a ready API built in flask just add the main function.

if __name__ == '__main__':
app.run(debug=True)

Now save that script and run it in from your terminal:

filename.py

your app is running in the address http://127.0.0.1:5000

Now to test your API you can make a call from your browser as follows:

http://127.0.0.1:5000/predict?Buying=1&Maint=1&doors=1&persons=1&lug_boot=1&safety=1

You can see how we list down the features with their values, order here is very important as the model expecting the features to be in the order as in training dataset.

You can also make a call from your jupyter notebook:

import requests
response = requests.get('http://127.0.0.1:5000/predict?Buying=1&Maint=1&doors=1&persons=1&lug_boot=1&safety=1')
print(response.text)

Response from the API

Now you have a simple API running smoothly of course scaling this to be provided to many users will be another task but before that there several aspects you should look at like error handling and later on versioning and CI/CD which are leveraging on software engineering techniques. Also you may want to provide this API in serverless lambda or whatever that suits your need or application.

Full source code with jupyter notebook can be found from here: Car Condition Evaluation. I remembered that I have gone through a similar code a while a go from someone in twoardsdatascience but I dont remember the exact page so thanks to that person as well.