RLHF & DPO: How LLMs Learn to Follow Instructions

Understand the alignment techniques that turn base models into helpful assistants — Reinforcement Learning from Human Feedback, Direct Preference Optimization, and reward modeling.

20 min read

RLHFDPOAlignmentReward ModelPPOPython

The Alignment Problem

A pretrained LLM predicts the next token — it doesn't inherently try to be helpful, harmless, or honest. It can complete text in any style: toxic, misleading, or brilliant. Alignment is the process of training the model to produce outputs humans prefer. Without alignment, ChatGPT would be a text autocomplete, not an assistant.

The RLHF Pipeline

RLHF has three stages: (1) Supervised Fine-Tuning (SFT) — train on high-quality instruction-response pairs to establish basic instruction-following. (2) Reward Model Training — train a separate model to score responses as humans would, using comparison data (response A > response B). (3) RL Optimization — use PPO (Proximal Policy Optimization) to maximize the reward model's score while staying close to the SFT model (KL penalty prevents reward hacking).

python

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """A reward model that scores responses on a scalar scale."""

    def __init__(self, base_model):
        super().__init__()
        self.backbone = base_model  # Same architecture as the LLM
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask=None):
        outputs = self.backbone(input_ids, attention_mask=attention_mask)
        # Use the last token's hidden state as the sequence representation
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)  # Scalar reward per sequence


# Training: given pairs (chosen_response, rejected_response)
# Loss = -log(sigmoid(reward_chosen - reward_rejected))
def reward_loss(reward_chosen, reward_rejected):
    return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()

The reward model is the most critical component. If it's miscalibrated, the LLM will exploit its weaknesses (reward hacking). This is why companies invest heavily in human annotation quality and diverse preference data.

DPO: Simpler Alignment Without RL

Direct Preference Optimization (DPO) eliminates the need for a separate reward model and RL loop. It directly optimizes the language model on preference pairs using a classification-like loss. DPO is simpler to implement, more stable, and often matches RLHF performance. It's become the default alignment method for many open-source models.

python

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load SFT model (already fine-tuned on instructions)
model = AutoModelForCausalLM.from_pretrained("./sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("./sft-model")  # Frozen reference
tokenizer = AutoTokenizer.from_pretrained("./sft-model")

# DPO preference dataset format:
# Each example has: prompt, chosen (preferred response), rejected (worse response)
dpo_dataset = [
    {
        "prompt": "Explain quantum computing simply.",
        "chosen": "Quantum computers use qubits that can be 0, 1, or both at once...",
        "rejected": "Quantum computing is a complex field involving Hilbert spaces...",
    },
    # ... thousands of preference pairs
]

training_args = DPOConfig(
    output_dir="./dpo-model",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,            # Very low LR for alignment
    beta=0.1,                       # KL penalty strength
    bf16=True,
    logging_steps=10,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,           # Reference model for KL divergence
    args=training_args,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

trainer.train()

RLHF vs DPO: When to Use Which

RLHF is more flexible and can optimize complex reward signals (including multi-objective rewards), but it's unstable and requires careful tuning of PPO hyperparameters. DPO is simpler, more stable, and works well for standard alignment. Use RLHF when you have a custom reward model or need fine-grained control. Use DPO when you have preference pairs and want straightforward alignment. Most open-source models now use DPO or its variants (IPO, KTO).

The full pipeline in practice: Pretrain (1-3 months, millions of dollars) → SFT (days, instruction data) → DPO/RLHF (days, preference data) → Safety evaluation → Red-teaming → Deployment. Alignment is a continuous process, not a one-time step.

Algorithms & DSBeginner

Arrays & Strings: The Foundation of DSA

Master the most fundamental data structure — arrays and strings. Learn traversal patterns, two-pointer technique, sliding window, and common interview patterns.

ArraysStringsTwo PointersSliding Window

15 min read

Read

Algorithms & DSBeginner

Hash Tables: O(1) Average Lookup Explained

Understand how hash tables work internally — hash functions, collision resolution, load factors — and master hash map patterns for solving problems efficiently.

Hash TablesHash MapsSetsCounting

14 min read

Read

Algorithms & DSBeginner

Linked Lists: Pointers, Patterns & Pitfalls

Master singly and doubly linked lists — insertion, deletion, reversal, cycle detection, and the fast/slow pointer technique that solves countless problems.

Linked ListsPointersFast SlowReversal

16 min read

Read