Skip to main content
  1. Data Science Blog/

Basics of Word Embedding

·2238 words·11 mins· loading · ·
Natural Language Processing (NLP) AI/ML Models Data Science NLP Concepts Natural Language Processing (NLP) Data Representation Text Processing Machine Learning (ML) Neural Networks Language Models (LLMs)

On This Page

Table of Contents
Share with :

Basics of Word Embedding

Basics of Word Embedding
#

What is Context, target and window?
#

  • The “context” word is the surrounding word.
  • The “target” word is the middle word.
  • The “window distance” is number of words (including) between context words and target word. Window distance 1 means, one word surronding the target, one left side context word, one right context word. Two window distance means 2 words left and 2 words right.

Let’s take a sentence

The quick brown fox jump over a lazy dog.

R- Right, L - Left

targetcontext 1 windowcontext 2 window
thequick (R)quick(R), brown(R)
quickthe(L), brown(R)the(L), brown(R), fox(R)
brownquick(L), fox(R)the(L), quick(L), fox(R), jump(R)
foxbrown(L), jump(R)quick(L), brown(L), jump(R), over(R)

When creating dataset you don’t write multiple words in one row, but you create multiple rows, as below.

targetcontext 2 window
thequick
thebrown
quickthe
quickbrown
quickfox

What is Skipgram?
#

Skipgram: With the help of target word we want to predict the context/surrounding word. From above example predicting “quick”, “brown”, “the”, “brown” etc with target word “the”, “quick”

What is CBOW (Continuous Bag of Words)
#

CBOW : With the help of context we want to predict target. From above example, predicting “the”, “quick” when context words are “quick” or “brown”, “the”, “fox”.

How CBOW works?
#

For both, CBOW and Skipgram networks works in the same way as mentioned below. Only difference is when we are using CBOW we want to predict target word from context word. If you are using Skipgram then we want to predict context word from a target word.

Finalize the corpus (Step 1)
#

In reality corpus is extremely huge size, it is like entire wikipedia text or entire stakeoverflow text or entire quora text. For the illustration of skipgram we are taking a small example.

Corpus : The quick brown fox jump over the dog

Create Skipgram (Step 2+3+4 )
#

As discussed earlier created 1 or 2 or 3 window skipgram from the corpus.

Word(Step 3) onehot encoding for each word in the corpus(Step 4) random initial embedding, 4 dimensional
the[1,0,0,0,0,0,0,0][0.11,0.12,0.14,0.15]
quick[0,1,0,0,0,0,0,0][0.21,0.23,0.24,0.26]
brown[0,0,1,0,0,0,0,0][0.31,0.34,0.36,0.38]
fox[0,0,0,1,0,0,0,0][0.51,0.12,0.14,0.15]
jump[0,0,0,0,1,0,0,0][0.21,0.63,0.24,0.26]
over[0,0,0,0,0,1,0,0][0.31,0.34,0.86,0.38]
the[0,0,0,0,0,0,1,0][0.71,0.12,0.14,0.15]
dog[0,0,0,0,0,0,0,1][0.21,0.93,0.24,0.26]

Create Neural Network (Step 5)
#

Create a neural network for learning embedding.

  • One input layer which can accept token/words. Convert token (context and target words) into onehot encoding
  • One embedding layer, for example sake we are taking 4 dimensional embedding of words. These embedding are randomingly intiated number initally (there are other ways also).
  • One dense layer of 5 neuron (example)
  • Softmax function
  • Output layer (to predict the probability of the predicted word. If vocabulary size of the corpus is 10,000 words, then softmax will predict 10,000 probabilities)
  • Loss function - Cross entropy loss function. L = $$- \sum_{i=1}^{N} y_{i} \cdot \log(p_{i})$$, N is vocab size.
  • 4 numbers from embedding will go to each of the 5 neuron, Each neuron will have 4 weights to embedding layer. 5*4 = 20 weights are learned + 5 biases learned
  • Learning Rate LR = .0002

Training - Forward propagation (Step 6+7+8+9+10)
#

  • Randomly initialize all the weights and biases of the network.
  • Pass target and context word to the network.
Step 6--Step 7Step 8-Step 9
Input layerEmbedding layerHidden layer (5 neuron, random init w&b), dense layermatmul between weights and inputs (embedding)softmax (8 vocab size)actual vector for “quick”cross entropy loss
The (context), quick (target)context (The) = [.11,.12,.14,.15]n1=[.11,.12,.13,.14]0.06570.186700.0897
target (quick) = [.21,.23,.24,.26]n2=[.13,.14,.15,.16]0.07610.188610.7244
n3=[.21..22,.23,.24]0.11770.196600.0951
n4=[.32,.33,.34,.35]0.17490.208200.1014
n5=[.42,.43,.45,.46]0.22980.219900.1079
Step 10
Total Loss1.1185

Training - backward propagation (Step 11+12)
#

Updating weights of network neurons

Step 11 (Gradient Calculation for 20 weights)1234
dL/dw110.179.328.607.99
dL/dw28.607.997.466.99
dL/dw35.335.084.664.66
dL/dw43.503.393.203.20
dL/dw52.662.602.432.43
Step 12 Updated Weights1234
new w10.110.120.130.14
new w20.130.140.150.16
new w30.210.220.230.24
new d40.320.330.340.35
new w50.420.430.450.46

new weight = old weight - Learning Rate * DL/dW

Update Embedding (Step 13+14 )
#

Old Embedding (Vector)1234
context (The) = [.11,.12,.14,.15]0.1100.1200.1400.150
target (quick) = [.21,.23,.24,.26]0.2100.2300.2400.260
Step 13 (Gradient of Old Embedding)
dL/context10.179.327.997.46
dL/target5.334.864.664.30
Step 14 (Updated Embedding)
context (The)0.1080.1180.1380.149
target (quick)0.2090.2290.2390.259

Complete the Training
#

  • Perform training forward and backword propagation in batch, multiple words at a time.
  • Everytime update w&b and also update embedding.
  • Trained embedding can be used in future without these training steps.
  • Let entire dataset of paired words go through this network. One it goes through it is called one epoch.
  • Let embedding get updated over multiple epoch say 50 or 100. More epoch, will cause better embedding. It will cost more money.
  • More dimentional vector will have better represenation but will cost more computation and more money.

Other Methods of Embedding
#

TF-IDF
#

TF-IDF - Term Frequency - Inverse Document Frequency, is an old, traditional, frequency based text embedding technique. It is not based on neural network architecture therefore does not need expensive hardware to create these embedding and use TF-IDF embedding. Like skipgram or CBOW it is not vector based but frequency based, therefore understandign symantic of the text is not possible with TF-IDF. There is no use of pretrained embedding, everytime we have a corpus we need to create embedding for that and it is used only for that. We cannot use TF-IDF embedding, which was created using news text for something else, say history or enterainment. Thus, embedding transfer is meaninless but task transfer can be done. It means TF-IDF embedding which is used for classficittion purpose can be used for other task like topic modelling, sentiment analysis etc. Obviously there is a limit, we cannot use it for other task like translation or summarization.

How TF-IDF works?
#

  • Term frequency (TF): The number of times a word appears in a document.
  • Inverse document frequency (IDF): The logarithm of the number of documents in the collection divided by the number of documents that contain the word.
  • The TF-IDF score for a word in a document is calculated as follows:
  • TF-IDF = TF * IDF (The higher the TF-IDF score, the more important the word is to the document.)
Document_1: "The quick brown fox jumps over the lazy dog."
Document_2: "The dog is lazy, but the fox is quick."

# Term frequency for the word "quick" in Document 1
TF(quick, Document_1) = 1

# Inverse document frequency for the word "quick"
IDF(quick) = log(2 / 1) = 0

# TF-IDF score for the word "quick" in Document 1
TF-IDF(quick, Document_1) = TF(quick, Document_1) * IDF(quick) = 1 * 0 = 0

# Term frequency for the word "quick" in Document 2
TF(quick, Document_2) = 1

# Inverse document frequency for the word "quick"
IDF(quick) = log(2 / 1) = 0

# TF-IDF score for the word "quick" in Document 2
TF-IDF(quick, Document_2) = TF(quick, Document_2) * IDF(quick) = 1 * 0 = 0

# Term frequency for the word "lazy" in Document 1
TF(lazy, Document_1) = 1

# Inverse document frequency for the word "lazy"
IDF(lazy) = log(2 / 1) = 0

# TF-IDF score for the word "lazy" in Document 1
TF-IDF(lazy, Document_1) = TF(lazy, Document_1) * IDF(lazy) = 1 * 0 = 0

# Term frequency for the word "lazy" in Document 2
TF(lazy, Document_2) = 1

# Inverse document frequency for the word "lazy"
IDF(lazy) = log(2 / 1) = 0

# TF-IDF score for the word "lazy" in Document 2
TF-IDF(lazy, Document_2) = TF(lazy, Document_2) * IDF(lazy) = 1 * 0 = 0

GloVe (Global Vectors)
#

GloVe is a method that learns word embeddings from global word-word co-occurrence statistics. It is similar to Skipgram and CBOW, but it is better at capturing long-range semantic relationships between words. GloVe embedding is good for text classification, and machine translation (MT).

How GloVe embedding works?
#

  • Tokenize the corpus: Split the corpus into individual words and punctuation marks.
  • Count word co-occurrences: For each word in the vocabulary, count how many times it co-occurs with other words in a given window size.
  • Build a word-word co-occurrence matrix: The word-word co-occurrence matrix is a square matrix, where each row and column represents a word in the vocabulary. The value at each cell in the matrix represents the number of times the two corresponding words co-occur in the corpus.
  • Factorize the word-word co-occurrence matrix: Factorize the word-word co-occurrence matrix into two lower-dimensional matrices, one for word embeddings (relationship between words) and one for context embeddings (relationship between words in the context). We can factorize the word-word co-occurrence matrix using a variety of matrix factorization techniques, such as singular value decomposition (SVD) or nonnegative matrix factorization (NMF).
  • Normalize the word embeddings: Normalize the word embeddings so that they have a unit length. We can normalize the word embeddings by dividing each embedding by its L2 norm. This will ensure that all of the embeddings have a unit length.
import numpy as np
from gensim.models import KeyedVectors

# Load the corpus
corpus = open("corpus.txt", "r").read()

# Tokenize the corpus
tokens = corpus.split()

# Count word co-occurrences
word_co_occurrences = np.zeros((len(tokens), len(tokens)))
for i in range(len(tokens)):
    for j in range(len(tokens)):
        if tokens[i] != tokens[j]:
            word_co_occurrences[i, j] = tokens.count(tokens[i] + " " + tokens[j])

# Factorize the word-word co-occurrence matrix
glove_model = KeyedVectors(word_vectors=word_co_occurrences, size=100)

# Save the word embeddings
glove_model.save("glove_embeddings.txt")

How SVD (Singular Value Decomposition) works?

import numpy as np

# Create a word-word co-occurrence matrix
word_co_occurrences = np.array([
    [1, 2, 3],
    [2, 1, 4],
    [3, 4, 1]
])

# Perform SVD
U, S, Vh = np.linalg.svd(word_co_occurrences)

# Truncate the singular values
S_truncated = S[:2]

# Reconstruct the word-word co-occurrence matrix
word_co_occurrences_reconstructed = np.dot(U[:, :2], np.dot(S_truncated, Vh[:, :2]))

# Print the reconstructed word-word co-occurrence matrix
print(word_co_occurrences_reconstructed)

# Results
[[-0.50578521 -0.25523155]
 [-0.58437383 -0.60130182]
 [-0.63457746  0.75716113]]
[[-0.50578521 -0.58437383]
 [ 0.25523155  0.60130182]
 [ 0.82403773 -0.54492509]]

BERT (Bidirectional Encoder Representations from Transformers)
#

BERT is a transformer-based language model that can learn word embeddings from unlabeled text, we need not to create skipgram pairs. BERT embeddings are particularly good at capturing contextual information. BERT embedding is good for MT, QA, Classification tasks.

How BERT does embedding?
#

  • Tokenization: The first step is to tokenize the sentence into words. This means splitting the sentence into individual words, including punctuation marks. The tokenized sentence is then represented as a sequence of integers (we create ids), where each integer represents a word in the vocabulary.
  • Word embedding lookup: BERT uses a pre-trained word embedding table to convert each word in the sequence into a vector of numbers. This vector represents the meaning of the word in a distributed manner.
  • Segment embedding lookup: BERT also uses a segment embedding table to encode the position of each word in the sentence. This is necessary because BERT is a bidirectional language model, and it needs to know the context of each word in order to learn meaningful embeddings.
  • Positional embedding lookup: BERT also uses a positional embedding table to encode the absolute position of each word in the sentence. This is necessary because BERT needs to know the order of the words in the sentence in order to learn meaningful embeddings.
  • Transformer encoding: The encoded sequence of word embeddings, segment embeddings, and positional embeddings is then passed to the transformer encoder. The transformer encoder is a neural network architecture that learns long-range dependencies between words in a sentence.
  • Output embedding: The output of the transformer encoder is a sequence of vectors, where each vector represents the embedding of the corresponding word in the sentence. These embeddings are then used for downstream natural language processing tasks, such as machine translation, text classification, and question answering.
# Tokenize the sentence
sentence = "The quick brown fox jump over the lazy fox"
tokens = sentence.split()

# Convert each word to a word embedding vector
word_embeddings = []
for token in tokens:
    word_embeddings.append(bert_model.get_word_embedding(token))

# Create segment embeddings
segment_embeddings = []
for i in range(len(tokens)):
    if i < len(tokens) // 2:
        segment_embeddings.append(bert_model.get_segment_embedding(0))
    else:
        segment_embeddings.append(bert_model.get_segment_embedding(1))

# Create positional embeddings
positional_embeddings = []
for i in range(len(tokens)):
    positional_embeddings.append(bert_model.get_positional_embedding(i))

# Encode the sentence
encoded_sentence = bert_model.encode(word_embeddings, segment_embeddings, positional_embeddings)

# Output embeddings
output_embeddings = encoded_sentence

FastText (Fast Text)
#

FastText is a modification of Skipgram that can learn embeddings for words and subwords. This makes it better at representing rare words and out-of-vocabulary words. FastText is good for name-entity-recognition (NER) & Question Answering (QA) tasks.

ELMo (Embeddings from Language Models)
#

ELMo (Embeddings from Language Models) is a deep contextual word embedding technique that uses a bidirectional language model (biLM) to learn word representations. A biLM is a type of neural network that can learn to predict the next word in a sentence, as well as the previous word. Unlike skipgram, which predicts next words, biLM is bidirectional. From a target word biLM can predict next and previous words.

Resources
#

If you want to understand all skipgram/cbow caluclation with excel and then you can use this calculation sheet

Dr. Hari Thapliyaal's avatar

Dr. Hari Thapliyaal

Dr. Hari Thapliyal is a seasoned professional and prolific blogger with a multifaceted background that spans the realms of Data Science, Project Management, and Advait-Vedanta Philosophy. Holding a Doctorate in AI/NLP from SSBM (Geneva, Switzerland), Hari has earned Master's degrees in Computers, Business Management, Data Science, and Economics, reflecting his dedication to continuous learning and a diverse skill set. With over three decades of experience in management and leadership, Hari has proven expertise in training, consulting, and coaching within the technology sector. His extensive 16+ years in all phases of software product development are complemented by a decade-long focus on course design, training, coaching, and consulting in Project Management. In the dynamic field of Data Science, Hari stands out with more than three years of hands-on experience in software development, training course development, training, and mentoring professionals. His areas of specialization include Data Science, AI, Computer Vision, NLP, complex machine learning algorithms, statistical modeling, pattern identification, and extraction of valuable insights. Hari's professional journey showcases his diverse experience in planning and executing multiple types of projects. He excels in driving stakeholders to identify and resolve business problems, consistently delivering excellent results. Beyond the professional sphere, Hari finds solace in long meditation, often seeking secluded places or immersing himself in the embrace of nature.

Comments:

Share with :

Related

What is a Digital Twin?
·805 words·4 mins· loading
Industry Applications Technology Trends & Future Computer Vision (CV) Digital Twin Internet of Things (IoT) Manufacturing Technology Artificial Intelligence (AI) Graphics
What is a digital twin? # A digital twin is a virtual representation of a real-world entity or …
Frequencies in Time and Space: Understanding Nyquist Theorem & its Applications
·4103 words·20 mins· loading
Data Analysis & Visualization Computer Vision (CV) Mathematics Signal Processing Space Exploration Statistics
Applications of Nyquists theorem # Can the Nyquist-Shannon sampling theorem applies to light …
The Real Story of Nyquist, Shannon, and the Science of Sampling
·1146 words·6 mins· loading
Technology Trends & Future Interdisciplinary Topics Signal Processing Remove Statistics Technology Concepts
The Story of Nyquist, Shannon, and the Science of Sampling # In the early days of the 20th century, …
BitNet b1.58-2B4T: Revolutionary Binary Neural Network for Efficient AI
·2637 words·13 mins· loading
AI/ML Models Artificial Intelligence (AI) AI Hardware & Infrastructure Neural Network Architectures AI Model Optimization Language Models (LLMs) Business Concepts Data Privacy Remove
Archive Paper Link BitNet b1.58-2B4T: The Future of Efficient AI Processing # A History of 1 bit …
Ollama Setup and Running Models
·1753 words·9 mins· loading
AI and NLP Ollama Models Ollama Large Language Models Local Models Cost Effective AI Models
Ollama: Running Large Language Models Locally # The landscape of Artificial Intelligence (AI) and …