Embeddings and The Age of Transformers

What are embeddings and where to use them?

The prime requirement for most machine learning algorithms is numeric data. All your inputs that represent real-world entities should be translated into numeric values. Applied machine learning is rooted in features engineering techniques to produce the best results. This has brought domain knowledge to be of significant importance because crafting new ways to represent real-world entities can be done well with domain expertise. Lately, the emergence of transformers such as BERT has been transformative for natural language processing/understanding NLP/NLU.

With images, numerical representation can be as crude as using pixel intensity values. Other types of data like tabular or textual data have their ways to make these transformations. Here, I will speak in the context of natural language understanding NLU and how the field moved from the lexicon level towards the semantic representation.


Embeddings are a way to represent text numerically. We do this text-to-numeric transformation because again machine learning requires us to do so. Here is an over-simplistic example of representing text for machine learning:

text1 = "I love bananas"
text2 = "I love apples"
text3 = "bananas and apples"

embedding1 = [0, 1, 2]
embedding2 = [0, 1, 3]
embedding3 = [2, 4, 3]

As you see from each text we simply denoted each unique word with a single numeric value. The result of this is that we now have each statement represented as a simple low dimension vector. That vector can be processed by machine learning algorithms to perform various tasks be it text classification or others. You can observe that some words are repeated, the purpose of this is to show you have each unique word is given a unique value.

Bag of words

The bag of words technique ly a bag-like vector where every column corresponds to a unique word and each row represents one sentence. Each unique term/word will be represented by 0/1 denoting its presence in the sentence. If we use the above example we will have something that looks like this:

    I    love     bananas   apples  and
[ [1,     1,         1,                0,          0],
  [1,     1,        0,                 1,          0],
  [0,    0,        1,                 1,           1]


Term frequency-inverse document frequency comes to enhance what we have from the bag of words. The values, instead of zeros and ones to indicate the presence or absence of a word in a sentence, are used to weigh the importance of the word in the documents. Think of it like the weight of the words and it comprises two parts.

  • Term frequency:

    TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

This measures how frequently a term or word appears in a document. Documents are of varying lengths and in long documents, a term might occur more often. So dividing this by the total terms in the document is a normalization step. At this stage, all terms are considered equally important.

  • Inverse document frequency:

    IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

Certain terms like "the" and "is" are of high frequency but we know they are not important. Computing IDF allows measuring the importance of the term by scaling up the weight of rare terms.

Assume we have a document of 100 words wherein the word fish appeared 3 times. The term frequency (i.e., tf) for fish is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word fish appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 4 = 0.12. - Example from tf-idf*

This change in the way we can represent the sentences has brought with it substantial performance improvements. These two methods are basic and they are thought of more as approaches to represent texts digitally. The field has gone on with the early versions of text embeddings with word2vec, sentence2vec, and doc2vec. Word2vec is generating dense embeddings from words. Under the hood, words are encoded into one-hot vectors and forwarded to a hidden neural layer which produces hidden weights. These hidden weights are used as embeddings.

Pretrained models

BERT stands for Bidirectional Encoder Representation Transformer and it's pre-trained and developed by Google. The improvements here are from removing the unidirectional constraint that standard transformers have had. You can refer to the original BERT paper for an in-depth understanding of its architecture.

The second breakthrough came from Generative Pretrained Transformer GPT. The transformer architecture when added with unsupervised learning, has changed the scope of natural language understanding. This means training task-specific models from scratch are relics of the past. But GPT and BERT are the first waves of transformers, today we have ever powerful and advanced models than those changing how we practice NLU.

I firmly believe that we have entered a new era and it calls for your attention to understand how to use pre-trained models. Allow me to show you a couple of applications or perhaps use cases in which I have leveraged these advancements in my projects.

In today's world we - as users - expect search to be better than text matching. Users implicitly expect search results to be refined and semantically sound. This is challenging without transforms because text vectorisation can only help us learn if two texts are syntactically similar. Transformers however bridge that gap and brings about the ability to measure semantic similarity. Let's have a simple example which was mentioned in an old article.

Say you have two terms one is a drink the other is a meal. We know that both are under the food & beverages category and they should be mathematically closer than something like a gadget. If we represented the name of each item using one-hot encoding for example we will have an arbitrary representation that doesn't disclose this semantic meaning. However, if you just used a pre-trained model to give generate embeddings that represent each and then measured the cosine distance we will see that we can tell what sentences are similar. I have curated this simple example to illustrate how we can do this using the sentence transformer library in python.

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

# Single list of sentences
sentences = ["chicken burger with fries",
             "coke with ice",
             "iphone 13 pro with 5k pixels"]

#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

The output of this as per what your human experience:

chicken burger with fries          coke with ice          Score: 0.2669
coke with ice          iphone 13 pro with 5k pixels          Score: 0.0684
chicken burger with fries          iphone 13 pro with 5k pixels          Score: 0.0613

the first two sentences are close which is determined by our human experience and the scores confirm this. The other sentences against each other prove they are of less similarity than the first two. Now, this can too be applied to search engines in a few ways. One way is to add a ranking and sorting logic that takes the results of the search engine before they are presented to the user and rank them based on their semantic similarity score measured against the searching query or sentence.

Sentiment and text classification

If you are doing any project where are solving a text classification project, you can leverage these pre-trained models to produce embeddings. These dense low-dimensional vectors are rich in their ability to bring not only syntax but semantics-level understanding to your model. Here is a sample code to demo:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)

The embedding will look something like this:

[-1.37173440e-02 -4.28515337e-02 -1.56286079e-02  1.40537489e-02
  3.95537838e-02  1.21796302e-01  2.94333789e-02 -3.17523852e-02
  3.54959555e-02 -7.93140158e-02  1.75878275e-02 -4.04369608e-02
  4.97259833e-02  2.54912414e-02 -7.18700588e-02  8.14968869e-02
  1.47068209e-03  4.79627326e-02 -4.50336412e-02 -9.92174968e-02
 -2.81769391e-02  6.45046309e-02  4.44670543e-02 -4.76217493e-02
 -3.52952704e-02  4.38672006e-02 -5.28565980e-02  4.33019130e-04
  1.01921476e-01  1.64072122e-02  3.26996520e-02 -3.45986746e-02
  1.21339737e-02  7.94871375e-02  4.58343467e-03  1.57778431e-02
 -9.68210120e-03  2.87625995e-02 -5.05805984e-02 -1.55793587e-02
 -2.87906975e-02 -9.62281693e-03  3.15556452e-02  2.27348991e-02
  8.71449485e-02 -3.85027565e-02 -8.84718448e-02 -8.75497889e-03
 -2.12343168e-02  2.08923966e-02 -9.02077779e-02 -5.25732227e-02
 -1.05638737e-02  2.88311075e-02 -1.61455162e-02  6.17838977e-03
 -1.23234931e-02 -1.07337432e-02  2.83353962e-02 -5.28567694e-02
 -3.58617976e-02 -5.97989261e-02 -1.09055163e-02  2.91566737e-02]

This powerful representation is revolutionary because we can scale the use of the model to more use cases without the need to retrain from scratch and this saves time and effort. Transformers has been setting new records on benchmarks of NLP tasks for a few years and they are more available to you than ever before. You can copy-paste any of the code examples into a collab and start playing around with them.


I have been fascinated by transformers and how they changed how I do NLP and look at problems. In this short story, I hope to help you in understanding the what, why and where you can use transformers and pre-trained models. I will be glad to know the projects and ideas you will build with transformers.

Did you find this article valuable?

Support Fares Hasan by becoming a sponsor. Any amount is appreciated!