Until recently, fine-tuning large language models (LLMs) on a single GPU was a pipe dream. This is because of the large size of these models, leading to colossal memory and storage requirements. For example, you need 780 GB of GPU memory to fine-tune a Llama 65B parameter model. The recent shortage of GPUs has also exacerbated the problem due to the current wave of generative models. That all changed with the entry of LoRA, allowing the fine-tuning of large language models on a single GPU such as the ones offered by Google Colab and Kaggle notebooks for free.
This dive will examine the LoRA technique for fine-tuning large language models such as Llama. Later, you'll also explore the code and try it yourself.
Why Fine-tune?
There are three main reasons why you'd consider fine-tuning a large language model:
Reduce hallucinations particularly when you pose questions the model hasn't seen in its training data
Make the model suitable for a particular use case, for example, fine-tuning on private company data
To remove or add undesirable and desirable behavior
Fine-tuning vs. Prompt Engineering
Compared to fine-tuning, prompt engineering is less expensive because there is no upfront cost in terms of hardware acceleration. You can also fit more data to the model during fine-tuning compared to prompt engineering. The cost of production is also much lower if fine-tuning results in a smaller optimized model.
Large Language Model Fine-tuning Strategies
Several methods have been proposed for fine-tuning large language models. One of them is LoRA(Low-Rank Adaptation of Large Language Models).
LoRA allows you to train weights specific to your use case and later merge them with the original model. The fact that you are training fewer weights compared to all the model weights makes it possible to use LoRA to fine-tune large language models on a single GPU.
Fine-tuning Large Language Models With LoRA
LoRA works by freezing the weights of the language model and introducing new matrices into the transformer layers, reducing the number of trainable parameters and making fine-tuning possible with less GPU compute. This is because there is less memory requirement. LoRA is different from prior methods because it doesn't introduce inference latency.
LoRA trains dense layers in the neural network indirectly through rank decomposition matrices of the dense layers. As shown in the following image, LoRA only trains the A and B matrices, leaving the pre-trained weights frozen.
LoRA makes it possible to use the same model for different tasks by swapping the LoRA weights, reducing the storage required for storing different models. Training with LoRA is also faster because only the LoRA matrices are being optimized, unlike full fine-tuning. The method can also be applied with other methods as you will see during fine-tuning.
The formula for computing the low-rank decomposition is:
where:
W0
is the pre-trained weight matrix∆W
is the accumulated gradient update during adaptationr
is the rank of the LoRA module, a number that you can tune during training
W0
is frozen during training while A
and B
contain the trainable parameters. A
is initialized using a random Gaussian while B
is set to zero at the beginning of training. For simplicity, LoRA is only applied to the query and value matrices of the transformer, meaning that the multi-layer perceptron is frozen and only the attention weights are adapted.
In LoRA a small set of trainable parameters–adapters– are introduced in the model while the pre-trained weights remain frozen. The loss function is optimized by passing the gradient through the frozen model into the adapters.
Fine-tuning Large Language Models With QLoRA
QLoRA goes further and proposes the use of 4-bit quantized weights. Quantization is a model optimization technique that reduces the precision of the values in the network. For example, from 32-bit floats to 8-bit integers. This reduces the size of the model and increases inference speed. QLoRA enables the fine-tuning of a 4-bit quantized large language model without negatively affecting its performance. The model works by quantizing the model and then adding trainable learnable Low-Rank Adapter weights. The weights are trained by backpropagating gradients through the quantized weights.
With QLoRA a 65B parameter model can be fine-tuned with 48GB VRAM compared to 780GB previously. QLoRA introduces:
4-bit NormalFloat data type that performs better than 4-bit integers and 4-bit Floats
Double quantization to further reduce the model size
Paged optimizers to prevent memory spikes, preventing the infamous "CUDA ran out of memory error".
Fine-tuning Llama 2 on a Single GPU
In the last issue, we covered the technical details of Llama 2. Next, we look at how to fine-tune the Llama 2 model on a single GPU. You can follow along using this Kaggle Notebook.
First, install all the required packages:
transformers
for loading the Llama modelpeft
for parameter-efficient fine-tuningtrl
for supervised fine-tuningaccelerate
for placing the model and input in the required devicesbitsandbytes
to quantize the Llama model to 4-bit
pip install peft transformers trl accelerate bitsandbytes
Import the required modules from these packages:
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
pipeline,
BitsAndBytesConfig
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
Preparing Data for Fine-tuning Large Language Models
According to the Llama paper, we can prepare the data by providing a system prompt and some instructions.
You therefore have to prepare the data in this format:
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message }} [/INST]
</s>
You can choose any appropriate dataset. However, we use the Lamini docs dataset in this case. The dataset contains some questions and answers about Lamini docs.
Load the dataset from Hugging Face:
qa_data = load_dataset('lamini/lamini_docs', split="train")
Prepare a prompt template in the required format:
prompt_template = """ <s>[INST] <<SYS>> You are a honest and helpful assistant who helps users find answers quickly from the given docs about Lamini.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
If the answer can not be found in the text please respond with `Let's keep the discussion relevant to Lamini docs`. <</SYS>>
### Question: {question}
### Answer: {answer}
[/INST] </s>
"""
Populate the prompt with the docs data:
import pandas as pd
df = pd.DataFrame(qa_data)
examples = df.to_dict()
text = examples["question"][0] + examples["answer"][0]
num_examples = len(examples["question"])
qa_finetuning_dataset = []
for i in range(num_examples):
question = examples["question"][i]
answer = examples["answer"][i]
text_with_prompt_template = prompt_template.format(question=question, answer=answer)
qa_finetuning_dataset.append({"text": text_with_prompt_template})
from pprint import pprint
print("One sample from the data:")
pprint(qa_finetuning_dataset[0])
For simplicity, I have already prepared the dataset and uploaded it to Hugging Face. Load it using the datasets library:
data_name = "mwitiderrick/lamini_llama"
training_data = load_dataset(data_name, split="train")
Quantizing Llama Model
Loading a quantized model reduces the GPU VRAM requirement and makes training faster and possible with less GPU RAM. This is achieved by defining the BitsAndBytesConfig
and passing it to the from_pretrained
function when loading the model.
In this case, we pass the following:
load_in_4bit
to load the model in 4-bitbnb_4bit_quant_type
asnf4
,(normalized float 4)bnb_4bit_compute_dtype
can be float16, bfloat16, or float32 because computations are done in either 16 or 32-bitbnb_4bit_use_double_quant
to determine if double quantization should be done
You can change these parameters and observe how they affect training time and GPU VRAM.
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False
)
Load the model and tokenizer:
llama_base_model_name = "meta-llama/Llama-2-7b-chat-hf"
# Path to save the new model / adapter weights
optimized_llama_model = "llama-2-7b-chat-mwitiderrick-lamini"
llama_tokenizer = AutoTokenizer.from_pretrained(llama_base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"
llama_base_model = AutoModelForCausalLM.from_pretrained(
llama_base_model_name,
quantization_config=quant_config,
device_map={"": 0}
)
llama_base_model.config.use_cache = False
llama_base_model.config.pretraining_tp = 1
Define LoRA Configuration
LoRA parameters are set using the LoraConfig
:
lora_alpha
is a LoRA scaling factorlora_dropout
is the dropout probability of the LoRA layersr
an integer that dictates how the matrices are updated, a lower rank leads to less trainable parameters.biases
determines which biases will be trained, the options arenone
,all
orlora_only
task_type
is causal because the model is a causal LLM
You can also tweak these parameters to see how they affect training and quality of the final model, but these are good defaults to start with:
# LoRA Config
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=8,
bias="none",
task_type="CAUSAL_LM"
)
Llama Fine-tuning Parameters
Define the parameters for training the Llama model. Notice the use of the paged optimizer mentioned earlier.
# Training Params
training_params = TrainingArguments(
output_dir="./llama_finetuning",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
optim="paged_adamw_32bit",
save_steps=25,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=False,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="constant",
report_to="tensorboard"
)
Train Llama Model on Custom Data
Next, fine-tune the model using SFTTrainer
while passing the:
Llama model
Training data
PEFT configuration
Column in the dataset to target
Training parameters
Tokenizer
# Trainer
llama_fine_tuning = SFTTrainer(
model=llama_base_model,
train_dataset=training_data,
peft_config=peft_config,
dataset_text_field="text",
tokenizer=llama_tokenizer,
args=training_params
)
# Training
llama_fine_tuning.train()
Fine-tuning the model on 2 epochs will take roughly 56 minutes on Kaggle notebooks and Colab. 10 epochs will require ~6 hours.
Running Inference on Fine-tuned Llama Model
The fine-tuned Llama 2 model can be used for inference immediately:
query = "How can I evaluate the performance and quality of the generated text from Lamini models"
text_gen = pipeline(task="text-generation", model=llama_base_model, tokenizer=llama_tokenizer, max_length=4096)
output = text_gen(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])
Merge LoRA Weights With Pre-trained Model
To deploy this model, you will need to merge the LoRA weights with the original model. You can then save the model locally or upload it to cloud storage such as on your Hugging Face account.
Merging requires reloading the model in full precision. You will get an error, if you try to merge the 4-bit model with the trained LoRA weights.
# Merge and save the fine-tuned model
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
llama_base_model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map={"": 0},
)
llama_model = PeftModel.from_pretrained(base_model, optimized_llama_model)
llama_model = llama_model.merge_and_unload()
# Reload tokenizer to save it
llama_tokenizer = AutoTokenizer.from_pretrained(llama_base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"
# Save the merged model
llama_model.save_pretrained(optimized_llama_model)
llama_tokenizer.save_pretrained(optimized_llama_model)
Despite deleting the previous models and clearing the GPU VRAM, merging the mode doesn't work on Google Colab or Kaggle notebooks. If you get it to work, I'd love to know which tricks you used. However, I got it to work by renting a GPU for a couple of minutes on vast.ai. It's an affordable alternative if you'd like to successfully merge the model. When you land there, choose a GPU with at least 24GB VRAM, I used 48.
del llama_base_model
del llama_fine_tuning
import gc
gc.collect()
gc.collect()
Final Thoughts
Parameters-efficient fine-tuning techniques have democratized large language models. With these methods, it is possible to fine-tune 7 billion parameter models on a single GPU and end up with an accurate model. This article has mainly discussed the methods as far as they relate to large language models, but their use isn't limited to LLMs. The methods can also be used to fine-tune other large models such as computer vision models.
Have you tried these methods to fine-tune large language or computer vision models? Let me know by replying to this email or leaving a comment below.
why multiple 80G gpu says torch.cuda.OutOfMemoryError