Portfolio/Learn/Fine-Tuning LLMs: LoRA, QLoRA & Parameter-Efficient Methods
Machine Learning & AIAdvanced

Fine-Tuning LLMs: LoRA, QLoRA & Parameter-Efficient Methods

Learn to adapt large language models to your domain without training from scratch. Master LoRA, QLoRA, and the full fine-tuning pipeline from data to deployment.

22 min read
March 17, 2026
Fine-TuningLoRAQLoRAPEFTHugging FacePython

Why Fine-Tune?

Pretraining gives an LLM broad knowledge, but fine-tuning specializes it. Fine-tuning on domain-specific data (medical, legal, code, customer support) dramatically improves performance on targeted tasks. But full fine-tuning of a 70B model requires 100+ GB of VRAM. Parameter-efficient methods like LoRA make it practical on a single GPU.

LoRA: Low-Rank Adaptation

LoRA freezes all original model weights and injects small trainable matrices into each attention layer. Instead of updating a d×d weight matrix W, LoRA adds BA where B is d×r and A is r×d, with rank r << d (typically 8-64). This reduces trainable parameters by 100-1000x while maintaining 95%+ of full fine-tuning quality. Only the LoRA adapters are stored and loaded at inference.

python
import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    """Low-Rank Adaptation layer."""

    def __init__(self, original: nn.Linear, rank: int = 16, alpha: float = 32):
        super().__init__()
        self.original = original
        self.original.weight.requires_grad_(False)  # Freeze original

        d_in, d_out = original.in_features, original.out_features
        self.lora_A = nn.Parameter(torch.randn(d_in, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, d_out))
        self.scale = alpha / rank

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        original_out = self.original(x)
        lora_out = (x @ self.lora_A @ self.lora_B) * self.scale
        return original_out + lora_out


# Example: 4096×4096 attention weight has 16.7M params
# LoRA with rank 16: only 131K trainable params (0.78%)

LoRA rank selection: r=8 for simple tasks (classification, sentiment). r=16-32 for complex tasks (instruction following, code). r=64+ for aggressive domain adaptation. Higher rank = more capacity but slower training.

QLoRA: Fine-Tuning on Consumer Hardware

QLoRA combines 4-bit quantization (NormalFloat4) with LoRA. The base model is quantized to 4-bit precision, reducing memory by 4x, while LoRA adapters remain in full precision. This enables fine-tuning a 65B parameter model on a single 48GB GPU. QLoRA matches full 16-bit fine-tuning quality on benchmarks.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

model = prepare_model_for_kbit_training(model)

# Apply LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 6.5M || all params: 8.03B || trainable%: 0.081%

Preparing Training Data

Fine-tuning data should be formatted as instruction-response pairs. Quality matters far more than quantity — 1,000 high-quality examples often outperform 100,000 noisy ones. Common formats: Alpaca (instruction/input/output), ShareGPT (conversations), and chat templates specific to each model family.

python
from datasets import Dataset
from trl import SFTTrainer, SFTConfig

# Prepare instruction dataset
data = [
    {
        "text": """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful medical assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What are common symptoms of type 2 diabetes?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Common symptoms include increased thirst, frequent urination,
unexplained weight loss, fatigue, and blurred vision.<|eot_id|>"""
    },
    # ... more examples
]

dataset = Dataset.from_list(data)

# Configure training
training_config = SFTConfig(
    output_dir="./lora-medical",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    max_seq_length=2048,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_config,
)

trainer.train()
model.save_pretrained("./lora-medical")

Merging and Deploying LoRA Adapters

python
from peft import PeftModel

# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-medical")

# Merge LoRA into base model (creates a standalone model)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama-medical-merged")

# Or serve with adapter swapping for multi-tenant deployments
# Different LoRA adapters for different domains — same base model

Fine-tuning decision tree: Need to follow instructions? → SFT. Need domain knowledge not in pretraining? → SFT on domain data. Need specific output format? → SFT with format examples. Need better alignment? → DPO/RLHF after SFT. Just need to use existing knowledge better? → Try prompting first before fine-tuning.