It's no doubt that generative AI will change every industry. The current state of generative AI, particularly text-to-image generation and text generation is a culmination of years of research. For example, current large language models (LLM) are based on the Transformer architecture that was invented by Google in 2017. The U-Net model is a critical part of current text-to-image generation models such as stable diffusion.
The generative AI space has shown massive potential courtesy of highly performant models such as Stable Diffusion from Stability AI and Llama by Meta AI. In this post, the focus will be on the Llama model that has been trained on trillions of tokens. Llama 2 is particularly revolutionary because of its performance and permissive license.
Llama 2 was trained on 40% more data compared to Llama 1. Llama 2-Chat is a fine-tuned version of Llama 2 for chat. The Llama 2 model was trained using grouped-query attention to improve inference scalability.
Llama 2 training is done as follows:
Pretraining using publicly available datasets
Create LLama 2-Chat using supervised fine-tuning
Further training using Reinforcement Learning with Human Feedback (RLHF)
Notable technical details about Llama 2 are, the use of:
The original transformer architecture
The bytepair encoding (BPE) algorithm for tokenization as implemented in SentencePiece
Training is done using the AdamW optimizer with β1 =0.9, β2 = 0.95, eps = 10−5
A cosine learning rate schedule, with a warmup of 2,000 steps
Weight decay of 0.1 and gradient clipping of 1.0
Pre-normalization using RMSNorm
SwiGLU activation function
Rotary positional embeddings
The transformer architecture was proposed in the Attention Is All You Need paper. The paper introduced the now popular self-attention for solving natural language problems, although the architecture has also been adapted for computer vision.
The transformer architecture consists of the following building blocks:
An encoder with 6 identical layers, each with 2 sub-layers. The first sub-layer is the multi-head self-attention mechanism and the second is a position-wise fully connected feed-forward network. The two sub-layers have a residual connection followed by a normalization layer.
A decoder with 6 identical layers with 2 sub-layers and a third sub-layer that performs multi-head attention over the encoder's output. The sub-layers have a residual connection followed by a normalization layer.
The transformer works with a query and a set of key-value pairs. The attention function involves mapping the query and key-value pairs to an output. The output is obtained as the weighted sum of the values.
The Llama 2 transformer architecture implements Grouped-query Attention (GQA) instead of multi-head attention (MHA).
Llama 2 uses Grouped-query Attention (GQA) to accelerate inference from the decoder. This is done by converting the multi-head attention (MHA) model checkpoints to use GQA.
The difference between multi-head, grouped query, and multi-query attention is as follows:
Multi-head uses H query, key, and value heads
Multi-query shares one key and value for all query heads. This arrangement is known to reduce the quality of the model.
Grouped-query attention shares one key and value head for a set of query heads
Pre-normalization Using Root Mean Square Layer Normalization (RMSNorm)
Layer normalization(LayerNorm) is a common technique in deep learning to stabilize training and enable convergence. However, layer normalization can slow the network. Enter RMSNorm which is simpler and more efficient compared to LayerNorm. RMSNorm regularizes the summed input in a layer using the root mean square.
SentencePiece is a subword tokenizer and detokenizer for natural language processing. It supports multiple subword algorithms such as bytepair encoding (BPE) and unigram. SentencePiece is fast, language independent and the size of the vocabulary is known before model training.
Rotary Positional Embeddings
Rotary Position Embedding(RoPE) is a technique for encoding position information into the learning process of the network. The original transformer added this information through a pre-defined function. RoPE adds the positional information using a rotation matrix. Rotary position embeddings have been proven to provide better performance for long text classification problems.
SwiGLU Activation Function
SwiGLU is a variant of Gated Linear Units(GLU) using Swish for the activation. This activation function is used to improve performance in the position-wise feed-forward network in the Llama transformer architecture.
Reinforcement Learning with Human Feedback (RLHF)
RLHF is a technique that is applied to large language models to make it possible for them to follow instructions and ensure the models behave with the provided human preferences. This is achieved by training a reward model from the human preferences and then using that model to automate the process. A reward model is used to determine the helpfulness and safety of the responses from the model.
Difference Between Llama 1 and Llama 2
The major difference between Llama 2 and Llama is that Llama 2 uses grouped query attention and doubles the context length. The context length for Llama 2 is 4096 tokens which enables the model to accept more information. This is critical for understanding long documents and remembering chat history.
Llama Fine-tuning Strategies
The talk in town is that fine-tuning is the new training. Unless you have the GPU cluster used by Meta to train Llama, then, it's unlikely that you can create a new language model from scratch. Be that as it may, fine-tuning these large language models is also challenging because of the computation power required. these Parameter-Efficient Fine-Tuning (PEFT ) are a set of strategies for making it possible to fine-tune these large language models on lower compute.
Here are two strategies that have been proposed for efficient fine-tuning large language models, making it possible to fine-tune them on a single T4 GPU such as the ones provided for free on Google Colab.
Low-Rank Adaptation (LoRA) freezes the weights of the pre-trained large language model and injects trainable rank decomposition in each layer of the model. This reduces the number of trainable parameters for downstream tasks, unlike normal fine-tuning which updates all the model parameters. LoRA doesn't degrade the model quality or introduce inference latency as other methods do.
LoRA reduces the storage requirements since the same model can be shared while replacing matrices A and B, depending on the task. It also requires less compute because it doesn't involve calculating gradients and maintaining any optimizer state. Optimization is only done on the injected low-rank matrices which are smaller. LoRA can also be combined with other methods such as prefix-tuning.
QLoRA finetunes a quantized a 4-bit model without impacting the model quality. It works by quantizing the model to 4-bit and injecting a set of learnable Low-Rank Adapter weights. These weights are then tuned through backpropagation on the quantized weights.
4-bit NormalFloat data typed that produces better results than 4-bit Integers and 4-bit Floats
Double Quantization to further reduce the model size
Paged Optimizers for reducing memory spikes when processing mini-batch with long sequences
Other methods to consider are:
Closed language models took the LLM space by storm but open-source models such as Llama 2 have started eating their pie. Their performance is catching up and will inadvertently surpass or be at par with closed models. Advancement of these open-source models is critical for both research and to enable individuals and companies to fine-tune the models and deploy them with their private data. Open-source models are also critical because they can be evaluated for potential biases and easily improved by the open-source community.