Transformers Is All You Need
In 2017, Ashish Vaswani et al wrote a paper that would change the natural language processing (NLP) scene forever, and most recently even computer vision. The authors proposed an efficient way to solve NLP problems without using Recurrent or Convolutional Neural Networks.
The architecture they proposed– the Transformer– would later be used to build state-of-the-art language models that have since taken over the world. This architecture is the precursor of the current wave of generative language models that have now become the new world assistants. To understand how we got here, we have to back to 2017 where it all started, and look at the Transformer architecture in detail.
The Transformer Architecture
In the Llama dive, we mentioned that the Llama 2 model is based on the Transformer architecture. In this dive, we review this architecture to make it easier to understand language models based on it.
The Transformer was a game changer because it didn't require recurrence or convolutions. These were replaced by attention, making training faster through parallelization. The transformer eschews recurrent networks, using a pure attention mechanism instead.
The Transformer is made up of stacked self-attention and fully-connected layers in the decoder and encoder. In the following sections, we discuss the building blocks of the Transformer encoder and decoder as elucidated in the Attention Is All You Need paper. All the figures and formulae are from that paper.
Encoder
The role of the encoder is to receive information and "encode" it into some numerical representation. The Transformer encoder is made up of 6 identical layers, each with 2 sub-layers:
A multi-head self-attention mechanism and
Position-wise fully-connected feed-forward network
The two sub-layers have residual connections followed by layer normalization.
Input Embeddings
Before the text can be passed to the Transformer, it has to be in some numerical representation. This is achieved through tokenization, a process that converts the input sentences into tokens.
Each token is represented using a number. However, the model needs a way to infer meaning from these tokens. This is achieved using word embeddings.
A word embedding is a numerical representation of the words in a vector space with words that are semantically similar being close to each other.
Positional Encoding
Unlike, previous architectures, the Transformer has no recurrence or convolution layers, meaning that the network would have a tough time understanding the position of words in the sentence. Failure to solve this would lead to incoherent output. To fix this, the Transformer adds positional encodings in the encoder and decoder to ensure that positional information is not lost.
The positional encodings have the same dimension as the input embedings so that they can be summed. The Transformer uses cosine and sine functions to add the positional information:
with
dmodel
being the model's dimension,pos
being the position andi
the dimension
Scaled Dot-Product Attention
Attention is computed using a set of queries, keys, and values. The computation is a weighted sum of the values done by mapping the key-value pairs to an output with each value having an assigned weight.
Attention is computed using this formula:
where:
dk
is the dimension of the key vectors, making the square root 8Q is the query matrix
K and V are the key and value matrices
Dividing by the square root of
dk
is a scaling factor that stabilizes gradients
Query, Key, and Value matrices are obtained by multiplying the input by query, key, and value weights.
The fact that the input sequence can attend to itself results in self-attention, where each word in the sequence is scored against other words. For example, consider the sentence, the fox could not jump over the fence because it was too high. Does it refer to the fox or the fence? The self-attention mechanism enables the network to decipher that it refers to the fence.
Passing the results to a softmax function ensures that the attention scores are positive, and add up to 1, making interpretation easy.
Multi-head Attention
Running many attention layers at the same time leads to Multi-Head Attention. The results from the different attention layers are then concatenated and passed to the feed-forward layer.
Multi-head attention allows the network to understand the input words and how they relate to the other words in the entire sentence.
After concatenation, the matrix is multiplied by WO
leading to a single matrix that can be passed to the feed-forward network. In the paper, h
is 8, meaning that there are 8 parallel attention heads.
Residuals
In the Transformer, information from past stages is passed to future stages through skip connections.
Layer Normalization
The Transformer applies layer normalization instead of batch normalization. Generally, normalization makes training neural networks faster by having all the values within a certain scale. Batch normalization depends on the mini-batch to compute the mean and variance for normalizing the input. Layer normalization doesn't depend on the mini-batch size to compute the mean and variance.
FeedFoward Network
The encoder and decoder both have a feed-forward network with two linear projections and a ReLU activation function between them.
Decoder
Like the encoder, the decoder is also made up of 6 identical layers with an extra sub-layer for multi-head attention on the encoder's output. It also has residual connections and layer normalization.
Masked Multi-Head Attention
The first self-attention sub-layer in the decoder uses masking to prevent attention on future positions, hence the name Masked Multi-head Attention. The masking makes the decoder auto-regressive. The masking is done by setting some of the values in the dot product to a large number such as negative infinity before passing it to the softmax layer.
Softmax Layer - Final Outpul Layer
The Transformer uses a softmax in the final layer to create next-token probabilities. It, therefore, predicts the probability of each token in its vocabulary being the next token. A common pattern is to use the greedy method which returns the token with the highest score.
Encoder-Decoder Models
The encoder and decoder parts of the Transformer can be used independently. For example, the encoder for classification and the decoder for text generation. Encoder-decoder models can be used for tasks such as translation where a sentence is passed to the encoder in one language and the decoder outputs it in another.
BERT is an example of an encoder-only model. These models are bidirectional, meaning that information flows in both directions. Decoder models such as GPT-2 only have access to previous but not future words. For example, to predict the fourth word, the model would only have access to words in positions 1 to 3.
Check out the Annotated Transformer to learn how to implement the architecture from scratch.
Final Thoughts
The Transformer is an efficient network for modeling sequence data because:
The network can be parallelized, making training on hardware accelerators faster because the attention computation can be done at the same time
The Transformer is better at classification compared to Recurrent Neural Networks
The Transformer has become dominant in the current wave of generative language models, and we can only expect that even future models will use this time-tested architecture.