Fine-Tuning LLMs: LoRA, QLoRA & Parameter-Efficient Methods
Learn to adapt large language models to your domain without training from scratch. Master LoRA, QLoRA, and the full fine-tuning pipeline from data to deployment.
Why Fine-Tune?
Pretraining gives an LLM broad knowledge, but fine-tuning specializes it. Fine-tuning on domain-specific data (medical, legal, code, customer support) dramatically improves performance on targeted tasks. But full fine-tuning of a 70B model requires 100+ GB of VRAM. Parameter-efficient methods like LoRA make it practical on a single GPU.
LoRA: Low-Rank Adaptation
LoRA freezes all original model weights and injects small trainable matrices into each attention layer. Instead of updating a d×d weight matrix W, LoRA adds BA where B is d×r and A is r×d, with rank r << d (typically 8-64). This reduces trainable parameters by 100-1000x while maintaining 95%+ of full fine-tuning quality. Only the LoRA adapters are stored and loaded at inference.
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
"""Low-Rank Adaptation layer."""
def __init__(self, original: nn.Linear, rank: int = 16, alpha: float = 32):
super().__init__()
self.original = original
self.original.weight.requires_grad_(False) # Freeze original
d_in, d_out = original.in_features, original.out_features
self.lora_A = nn.Parameter(torch.randn(d_in, rank) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(rank, d_out))
self.scale = alpha / rank
def forward(self, x: torch.Tensor) -> torch.Tensor:
original_out = self.original(x)
lora_out = (x @ self.lora_A @ self.lora_B) * self.scale
return original_out + lora_out
# Example: 4096×4096 attention weight has 16.7M params
# LoRA with rank 16: only 131K trainable params (0.78%)LoRA rank selection: r=8 for simple tasks (classification, sentiment). r=16-32 for complex tasks (instruction following, code). r=64+ for aggressive domain adaptation. Higher rank = more capacity but slower training.
QLoRA: Fine-Tuning on Consumer Hardware
QLoRA combines 4-bit quantization (NormalFloat4) with LoRA. The base model is quantized to 4-bit precision, reducing memory by 4x, while LoRA adapters remain in full precision. This enables fine-tuning a 65B parameter model on a single 48GB GPU. QLoRA matches full 16-bit fine-tuning quality on benchmarks.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
# Apply LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 6.5M || all params: 8.03B || trainable%: 0.081%Preparing Training Data
Fine-tuning data should be formatted as instruction-response pairs. Quality matters far more than quantity — 1,000 high-quality examples often outperform 100,000 noisy ones. Common formats: Alpaca (instruction/input/output), ShareGPT (conversations), and chat templates specific to each model family.
from datasets import Dataset
from trl import SFTTrainer, SFTConfig
# Prepare instruction dataset
data = [
{
"text": """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful medical assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What are common symptoms of type 2 diabetes?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Common symptoms include increased thirst, frequent urination,
unexplained weight loss, fatigue, and blurred vision.<|eot_id|>"""
},
# ... more examples
]
dataset = Dataset.from_list(data)
# Configure training
training_config = SFTConfig(
output_dir="./lora-medical",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
bf16=True,
max_seq_length=2048,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_config,
)
trainer.train()
model.save_pretrained("./lora-medical")Merging and Deploying LoRA Adapters
from peft import PeftModel
# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-medical")
# Merge LoRA into base model (creates a standalone model)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama-medical-merged")
# Or serve with adapter swapping for multi-tenant deployments
# Different LoRA adapters for different domains — same base modelFine-tuning decision tree: Need to follow instructions? → SFT. Need domain knowledge not in pretraining? → SFT on domain data. Need specific output format? → SFT with format examples. Need better alignment? → DPO/RLHF after SFT. Just need to use existing knowledge better? → Try prompting first before fine-tuning.