RLHF & DPO: How LLMs Learn to Follow Instructions
Understand the alignment techniques that turn base models into helpful assistants — Reinforcement Learning from Human Feedback, Direct Preference Optimization, and reward modeling.
The Alignment Problem
A pretrained LLM predicts the next token — it doesn't inherently try to be helpful, harmless, or honest. It can complete text in any style: toxic, misleading, or brilliant. Alignment is the process of training the model to produce outputs humans prefer. Without alignment, ChatGPT would be a text autocomplete, not an assistant.
The RLHF Pipeline
RLHF has three stages: (1) Supervised Fine-Tuning (SFT) — train on high-quality instruction-response pairs to establish basic instruction-following. (2) Reward Model Training — train a separate model to score responses as humans would, using comparison data (response A > response B). (3) RL Optimization — use PPO (Proximal Policy Optimization) to maximize the reward model's score while staying close to the SFT model (KL penalty prevents reward hacking).
import torch
import torch.nn as nn
class RewardModel(nn.Module):
"""A reward model that scores responses on a scalar scale."""
def __init__(self, base_model):
super().__init__()
self.backbone = base_model # Same architecture as the LLM
self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask=None):
outputs = self.backbone(input_ids, attention_mask=attention_mask)
# Use the last token's hidden state as the sequence representation
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden)
return reward.squeeze(-1) # Scalar reward per sequence
# Training: given pairs (chosen_response, rejected_response)
# Loss = -log(sigmoid(reward_chosen - reward_rejected))
def reward_loss(reward_chosen, reward_rejected):
return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()The reward model is the most critical component. If it's miscalibrated, the LLM will exploit its weaknesses (reward hacking). This is why companies invest heavily in human annotation quality and diverse preference data.
DPO: Simpler Alignment Without RL
Direct Preference Optimization (DPO) eliminates the need for a separate reward model and RL loop. It directly optimizes the language model on preference pairs using a classification-like loss. DPO is simpler to implement, more stable, and often matches RLHF performance. It's become the default alignment method for many open-source models.
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load SFT model (already fine-tuned on instructions)
model = AutoModelForCausalLM.from_pretrained("./sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("./sft-model") # Frozen reference
tokenizer = AutoTokenizer.from_pretrained("./sft-model")
# DPO preference dataset format:
# Each example has: prompt, chosen (preferred response), rejected (worse response)
dpo_dataset = [
{
"prompt": "Explain quantum computing simply.",
"chosen": "Quantum computers use qubits that can be 0, 1, or both at once...",
"rejected": "Quantum computing is a complex field involving Hilbert spaces...",
},
# ... thousands of preference pairs
]
training_args = DPOConfig(
output_dir="./dpo-model",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7, # Very low LR for alignment
beta=0.1, # KL penalty strength
bf16=True,
logging_steps=10,
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model, # Reference model for KL divergence
args=training_args,
train_dataset=dpo_dataset,
tokenizer=tokenizer,
)
trainer.train()RLHF vs DPO: When to Use Which
RLHF is more flexible and can optimize complex reward signals (including multi-objective rewards), but it's unstable and requires careful tuning of PPO hyperparameters. DPO is simpler, more stable, and works well for standard alignment. Use RLHF when you have a custom reward model or need fine-grained control. Use DPO when you have preference pairs and want straightforward alignment. Most open-source models now use DPO or its variants (IPO, KTO).
The full pipeline in practice: Pretrain (1-3 months, millions of dollars) → SFT (days, instruction data) → DPO/RLHF (days, preference data) → Safety evaluation → Red-teaming → Deployment. Alignment is a continuous process, not a one-time step.