Evaluating LLMs: Benchmarks, Metrics & Real-World Testing
Learn how to properly evaluate language models — MMLU, HumanEval, perplexity, LLM-as-judge, and why benchmarks alone don't tell the full story.
Why Evaluation Is Hard
LLM evaluation is fundamentally challenging because language generation is open-ended. There's no single 'correct' answer to most prompts. A model can be brilliant at coding but terrible at math, or great at English but poor at Japanese. Proper evaluation requires multiple benchmarks across different capability dimensions, plus domain-specific testing for your actual use case.
Key Benchmarks
MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects from elementary to professional level. HumanEval tests code generation (pass@1 rate). GSM8K tests grade-school math reasoning. HellaSwag tests commonsense reasoning. TruthfulQA tests resistance to common misconceptions. MT-Bench evaluates multi-turn conversation quality. No single benchmark captures overall capability — you need a suite.
Perplexity: The Fundamental Metric
Perplexity measures how 'surprised' a model is by the test text. Lower perplexity = better prediction of the next token. It's the most fundamental language model metric, but it doesn't directly measure usefulness. A model with low perplexity can still give bad advice. Use perplexity for comparing model architectures, not for evaluating assistants.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def compute_perplexity(model, tokenizer, text: str) -> float:
"""Compute perplexity of a text under the model."""
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
# Loss is average negative log-likelihood per token
perplexity = torch.exp(outputs.loss).item()
return perplexity
# Lower = better (the model predicts the text more accurately)
# GPT-2 on WikiText-103: ~29.4 perplexity
# LLaMA-7B: ~7.5 perplexity
# LLaMA-70B: ~4.2 perplexityLLM-as-Judge Evaluation
For open-ended generation, use a stronger LLM to judge the quality of a weaker model's outputs. This correlates well with human evaluation at a fraction of the cost. The judge rates responses on criteria like helpfulness, accuracy, and safety. GPT-4-class models make reliable judges when given clear rubrics.
from openai import OpenAI
client = OpenAI()
def llm_judge(question: str, response_a: str, response_b: str) -> dict:
"""Use GPT-4 to compare two model responses."""
judgment = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "system",
"content": """You are an impartial judge. Compare two AI responses to the same question.
Evaluate on: accuracy, helpfulness, clarity, and completeness.
Return JSON: {"winner": "A" or "B" or "tie", "reasoning": "...", "scores": {"A": 1-10, "B": 1-10}}"""
}, {
"role": "user",
"content": f"Question: {question}\n\nResponse A: {response_a}\n\nResponse B: {response_b}"
}]
)
return json.loads(judgment.choices[0].message.content)
# Run evaluation suite
def evaluate_model(model_name: str, test_cases: list[dict]) -> dict:
"""Evaluate a model across multiple test cases."""
scores = {"accuracy": [], "helpfulness": [], "format": []}
for case in test_cases:
response = generate(model_name, case["prompt"])
# Automated checks
if "expected_contains" in case:
scores["accuracy"].append(
any(kw in response for kw in case["expected_contains"])
)
# LLM judge for quality
judgment = llm_judge(case["prompt"], response, case.get("reference", ""))
scores["helpfulness"].append(judgment["scores"]["A"])
return {k: sum(v)/len(v) for k, v in scores.items() if v}The most important evaluation is on YOUR data, for YOUR use case. Public benchmarks tell you about general capability, but a model that scores 90% on MMLU might score 60% on your specific domain. Always build a custom eval set with 50-200 examples representative of your actual workload.