Introduction: What are word Embeddings?
Word embeddings are one of the most commonly used techniques in natural language processes. Word embeddings have been widely used for NLP tasks, including sentiment analysis, topic classification, and question answering. Word embeddings are precisely why language models like recurrent neural networks (RNN), long short term memory (LSTM) networks, ELMo, BERTS, AlBERTs, and the latest GPT-3 have advanced so rapidly.
These algorithms are fast, and they can generate language sequences and downstream tasks with high accuracy. They include contextual understanding, semantic properties, and syntactic properties. They also include the linear relationship between words.
Embedding is a technique used for extracting patterns from text or voice sequence. But how do they do that? Well, let’s see Word embeddings are a type of algorithm that maps words to vectors.
We’ll look at some of the earliest neural networks used to build complex algorithms for natural-language processing. Word embeddings are one of the most popular representations of document vocabulary. It is capable of identifying context of a word in an input sentence, semantic and syntactic similarities, relations with other words, etc..
Word embeddings allow words with similar meanings to be represented by vectors that are close together.
They are a distributed word representation that is perhaps one of deep learning’s most important breakthroughs for solving challenging natural language processing (NLP) problems.
Also Read: What is NLP?
What Are Word Embeddings?
A word embedding represents words in a way that words that mean the same thing have similar representations.
Vector representations of words are called “word embeddings”. Now that we’ve said this, let’s look at how we generate them. Most importantly, how can they understand the context? What techniques are used? There are a set of pre-trained word embeddings that take into effect the co-occurrence counts with deep learning models and intermediate fully-connected hidden layer, one-hot encoded vector, and layer output
Why Word Embeddings are used?
Since machine learning models cannot process textual data, we need to convert the textual data into numerical data so that they can use it. TF-IDF and Bag of Words have been discussed previously as techniques that can be used to achieve this goal. In addition to this, we can also use one-hot encoding or number-based representations to represent words in vocabulary. As opposed to the one-hot encoding, the latter approach is more efficient as we now have a dense matrix instead of a sparse one. This approach works even when our vocabulary is large.
One-hot encoding vs integer encoding
It captures no relationship between words, so the integer-encoding is arbitrary. A linear classifier, for example, learns one weight for each feature, which can be challenging for the model to interpret. In order for this feature-weight combination to be meaningful, there must be a relationship between the similarities of two words and their encodings.
In vector space, words that have a similar meaning are grouped together by their embeddings. When representing a word such as frog, its nearest neighbors would be frogs, toads, and Litoria. As a result, a classifier would not be thrown off when it sees the word Litoria during testing because the two-word vectors are similar. In addition, word embeddings learn relationships between words. An analogous word can be found by adding the differences between two vectors.
Deep learning has made significant progress on challenging natural language processing problems because of this method of representing words and documents.
Applied to words, embedding is the process of representing each word as a real-valued vector in a predefined vector space. This technique maps each word to a vector, and the vector values are learned in a manner reminiscent of neural networks, which is why it’s often referred to as deep learning. The approach relies on dense distributed representations of each word. There are many dimensions to each word, e.g. tens or hundreds. For sparse word representations, like a one-hot encoding, there are thousands or millions of dimensions.
Word usage enables the learning of distributed representations. As a result, words used in the same way can have similar representations, capturing their meaning naturally. Comparing this to a bag of words model where, unless explicitly managed, different words have different representations, regardless of how they are used. Words with similar contexts will have similar meanings. One hot vector is also a very integral part of word embedding and should be viewed with the fact of objective function.
Embedding matrix is a randomly initialized matrix whose dimensions are N * (Size of the vocabulary plus 1), where N is the number that we have to select manually and Size of the Vocabulary is the number of unique words that are within the document. The embedding matrix consists of a plurality of columns, each of which represents an individual word in the document
The embedding matrix will be trained over time using gradient descent to learn the values of the matrix in ways in which similar words will be grouped together according to their similarity. A boy may not need to be very loyal, whereas a king or queen may require a degree of loyalty. Both the King and the boy are male, which means that both the King and the boy had a high value corresponding to male.
The first thing you need to know is that even though these features (Royal, Male, Age, etc..) appear in the picture, we do not explicitly define them. The problem is that this is just a randomly initialized matrix that learns the values for these features along with their corresponding features using gradient descent.
Pre-Processing for Embedding Matrix
We know that we cannot use non-numerical data for machine learning and guess what, words are of course, non-numerical. So, let’s see how we have to convert them before the forward propagation.
There are a lot of algorithms for this:
- One Hot Encoding
- Term Frequency-Inverse Document Frequency
- Tokenization (Text to Sequence)
But, for this purpose, Tokenization is the most preferred and you will understand why in a few minutes.
Tokenization: Assigning a number for each unique word in the corpus is called as tokenization.
Example: Let’s assume that we have a training set with 3 training examples. [“What is your name”,”how are you”,”where are you”] if we have to tokenize this data, the result would be this:
What : 1, is : 2, your : 3, name: 4, how:5, are : 6, you : 7, where : 8
Tokenized form of first sentence: [1,2,3,4] Tokenized form of second sentence : [5,6,7] Tokenized form of third sentence : [8,6,7]
Now , The data is pre-processed. let’s move on to the forward pass.
Also Read: What is Tokenization in NLP?
In our training set, each column represents a word. We manually pick N, which represents the size of each word. The following example assumes a vocabulary size of 1000 and an N of 15.
Consider the following example:
Whenever we tokenize a word, we assign it a number. In this sense, the tokenized representation of “The Weather is Nice” might look like this [123,54,792,205].
Upon passing this array of tokens into the neural network for the forward pass, the embedding matrix contains 1000 columns. This is because the input is [123,554,792,205]. This embedded matrix contains the columns 123, 554, 792, 205.
There are 15 rows(N) in each of these columns. This is done by stacking the 4 columns on top of each other (flattening the 4 tensors to form a single tensor of size 15*4)
After being flattened, the tensor is passed to a RNN or Dense Layer to generate a prediction.
Embedding Matrix Values
Embedding matrices are nothing but the parameters which are learnt over time with help of gradient descent like other supervised learning algorithms.
It happens to learn such values in a way that the cosinesimilarity of similar words is pretty close to those words that are different.
Word Embedding Algorithms
The word embedding method learns a real-valued vector representation for a predefined vocabulary from a corpus of text.
On some tasks, such as document classification, the learning process may be performed jointly with a neural network model, or it may be unsupervised, using document statistics. From text data, we can learn word embeddings using three different techniques.
Embedding layers, for lack of a better term, are word embeddings that are learned jointly with a neural network model on a specific natural language processing task, such as document classification or language modeling.
The text of the document must be cleaned and formatted so that each word is encoded one-by-one. The model specifies a number of dimensions, such as 50, 100, and 300 for the vector space. Initialized with small random numbers, the vectors have 50, 100, and 300 dimensions. An embedding layer is used at the front end of a neural network and is fit by a supervised algorithm called Back propagation.
Using the one-hot encoding scheme, words are mapped to word vectors. Concatenating the word vectors prior to feeding them to an input Perceptron model is the way to use a multilayer Perceptron model. When using a recurrent neural network, each word can be treated as a single input.
An embedding layer can be learned by studying a lot of training data over a long period of time, but will learn an embedding tailored to both the specific text data and the NLP task.
With Bengio’s approach, NLP researchers have a new opportunity to modify the technique and architecture in order to create a method that’s computationally less expensive. Why?
Bengio et al. proposed the method of using feed forward neural networks with embedding layers, hidden layers, and softmax functions to learn vocabulary.
In these embeddings, there are associated learning vectors, which optimize themselves based on back propagation. The first layer of the architecture yields word embeddings since it is a shallow network.
The problem with this approach is that it’s computationally intensive between the hidden layer and projection layer. The reason for it has to do with a number of factors.
- The values produced in the projection are dense.
- The hidden layer computes probability distribution for all the words in the vocabulary.
In 2013, researchers (Tomas Mikolov et al.) came up with a model called ‘Word2Vec’ to address this issue. (Model trains)
Word2Vec addresses the issues raised by Bengio’s NLM.
Bengio’s model does away with the hidden layer entirely, but the projection layer is shared among all words. The disadvantage is that this simple model without a neural network will not be able to represent data as precisely as the neural network can, if there is less data.
However, with a large dataset, it is possible to represent the data more precisely in the embedding space. Additionally, it reduces complexity, and the model can be trained on larger datasets.
Using these representations we find that they capture syntactic and semantic regularities in language, and that each relationship is defined by a relation-specific offset vector. This allows vector-based reasoning based on offsets between words.
Word2Vec uses statistical techniques to efficiently learn standalone word embeddings from text corpus.
Tomas Mikolov et al. in 2013 proposed two models:
- Continuous Bag-of-Words Model
- Continuous Skip-gram Model
Continuous Bag-of-Words model (CBOW)
CBOW predicts the probability of a word to occur given the words surrounding it. Probability distribution is a good method to find this. We can consider a single word or a group of words. But for simplicity, we will take a single context word and try to predict a single target word.
The English language contains almost 1.2 million words, making it impossible to include so many words in our example. So I ‘ll consider a small example in which we have only four words i.e. live, home, they and at. For simplicity, we will consider that the corpus contains only one sentence, that being, ‘They live at home’.
First, we convert each word into a one-hot encoding form. Also, we’ll not consider all the words in the sentence but ll only take certain words that are in a window. For example for a window size equal to three, we only consider three words in a sentence. The middle word is to be predicted and the surrounding two words are fed into the neural network as context. The window is then slid and the process is repeated again.
Finally, after training the network repeatedly by sliding the window a shown above, we get weights which we use to get the embeddings as shown below.
Usually, we take a window size of around 8-10 words and have a vector size of 300.
The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to predict the source context words (surrounding words) given a target word (the centre word)
The working of the skip-gram model is quite similar to the CBOW but there is just a difference in the architecture of its neural network and the way the weight matrix is generated as shown in the figure below:
After obtaining the weight matrix, the steps to get word embedding is same as CBOW.
So now which one of the two algorithms should we use for implementing word2vec model? Turns out for large corpus with higher dimensions, it is better to use skip-gram but is slow to train. Whereas CBOW is better for small corpus and is faster to train too compared to previous models.
The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the word2vec model method for efficiently learning word vectors. Whether this will be a feature vector or not depends on what you are trying to achieve and your use cases.
Classical vector space model representations of words were developed using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using global text statistics but are not as good as the learned methods like word2vec at capturing meaning and demonstrating it on tasks like calculating analogies.
GloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in word2vec.
Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.
Neural Language Model
Word embeddings were proposed by Bengio et. al. (2001, 2003) to tackle what’s known as the curse of dimensionality, a common problem in statistical language modelling.
The Bengio method, which was known as distributed representation of words, was able to train a neural network so that each training sentence provided the model with information about semantically available neighboring words. In addition to establishing relationships between different words, the neural network preserved both semantic and syntactic relationships.
It was from this work that a neural network architecture approach was developed, which formed the foundation for many approaches used today.
This neural network has the following components:
- The embedding layer generates word embedding, and the parameters are shared among words.
- An embedded layer consisting of one or more layers that introduces non-linearity.
- A softmax function that produces a probability distribution over all the vocabulary words.
We have understood the following so far –
- The neural network language model (NNLM) or Bengio’s model outperforms earlier statistical models like the n-gram model.
- Through its distributed representation, NNLM overcomes the curse of dimensionality and preserves contextual, linguistic regularities and patterns.
- NNLM is computationally intensive.
- The Word2Vec model reduces computational complexity by removing the hidden layer and sharing the weights
- Despite Word2Vec’s lack of a neural network, it can be trained on a large number of examples and can be used to compute very accurate high dimensional word vectors.
- CBOW and Skipgram are two of Word2Vec’s models. CBOW is faster than Skipgram.
- There is a technique in Natural Language Processing called latent Dirichlet allocation (LDA) that allows observations to be explained by unobserved “groups”.
Using Word Embeddings
There are several options for using word embeddings in your NLP projects.
Learn an Embedding
- The word embedding you choose may depend on your problem.
- For embeddings to be learned, a large amount of text data is needed, such as millions or billions of words.
- When training your word embedding, there are two main options:
- Learn it Standalone, where a model is trained to learn an embedding, which is saved and used to create another model for your task later on. Using the same embedding across multiple models is a good approach if you wish to do that.
- The embedding is learned as part of a larger task-specific model. If you only intend to use the embedding on one task, this is a good approach.
Reuse an Embedding
- Researchers commonly make pre-trained word vectors available for free, often making them available under a permissive license, so that you can use these vectors on your own academic or commerical projects.
- For example, both Word2Vec and GloVe word vectors are available for free download (as well as pre-trained models).
- You can use these pre-trained embeddings instead of training your own.
- There are two main ways to use pre-trained embeddigs:
- Statically trained, where the embeddings are kept static and are used as components of your model. This is a suitable strategy if the embedding is well suited to your problem and gives good performance.
- Updated, where the pretrained embedding is used to train the model, but the pretrained embedding gets updated during the training of the network. If you’re looking for the best results from the model, then using it as an embedded model might be a good idea.
Which Option Should You Use?
Consider all the options, and if possible, test them to find out which gives you the best results.
Consider using a pre-trained embedding first, and new embeddings only if they improve performance. Distribution of dataset is very critical for good quality results and computational complexity should be kept in mind while deciding the approach.
Numerical representation, vocabulary size and neural network architecture are very important to keep performance within the benchmark. Keeping vocabulary size under check is a popular technique, especially when the dimensional space in which the algorithm needs to be run with the training objectives, and computational complexity in mind.
Word Embeddings are an important part of text interpretation. User data privacy and values of openness is very important in this scenario and the output layer should be the only layer visible in this case on a need to know basis.
Dataset, link datasets, semantic relationships, dense representations and parameters are very important and should be cleaned as much as possible for bias, and language before using to train. Distribution of dataset, and probability distribution are very critical for good quality results. Accurate word embeddings help you come up with better modeling of your data and help you reduce expensive computation. It keeps your current approach in the correct window rather than the incorrect window / incorrect version.
Word embedding opened up new avenues in NLP research and development. Although these models work well, they lack conceptual understanding.