Portfolio/Learn/Evaluating LLMs: Benchmarks, Metrics & Real-World Testing
Machine Learning & AIIntermediate

Evaluating LLMs: Benchmarks, Metrics & Real-World Testing

Learn how to properly evaluate language models — MMLU, HumanEval, perplexity, LLM-as-judge, and why benchmarks alone don't tell the full story.

16 min read
March 10, 2026
EvaluationBenchmarksMMLUMetricsLLMPython

Why Evaluation Is Hard

LLM evaluation is fundamentally challenging because language generation is open-ended. There's no single 'correct' answer to most prompts. A model can be brilliant at coding but terrible at math, or great at English but poor at Japanese. Proper evaluation requires multiple benchmarks across different capability dimensions, plus domain-specific testing for your actual use case.

Key Benchmarks

MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects from elementary to professional level. HumanEval tests code generation (pass@1 rate). GSM8K tests grade-school math reasoning. HellaSwag tests commonsense reasoning. TruthfulQA tests resistance to common misconceptions. MT-Bench evaluates multi-turn conversation quality. No single benchmark captures overall capability — you need a suite.

Perplexity: The Fundamental Metric

Perplexity measures how 'surprised' a model is by the test text. Lower perplexity = better prediction of the next token. It's the most fundamental language model metric, but it doesn't directly measure usefulness. A model with low perplexity can still give bad advice. Use perplexity for comparing model architectures, not for evaluating assistants.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, text: str) -> float:
    """Compute perplexity of a text under the model."""
    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])

    # Loss is average negative log-likelihood per token
    perplexity = torch.exp(outputs.loss).item()
    return perplexity

# Lower = better (the model predicts the text more accurately)
# GPT-2 on WikiText-103: ~29.4 perplexity
# LLaMA-7B: ~7.5 perplexity
# LLaMA-70B: ~4.2 perplexity

LLM-as-Judge Evaluation

For open-ended generation, use a stronger LLM to judge the quality of a weaker model's outputs. This correlates well with human evaluation at a fraction of the cost. The judge rates responses on criteria like helpfulness, accuracy, and safety. GPT-4-class models make reliable judges when given clear rubrics.

python
from openai import OpenAI

client = OpenAI()

def llm_judge(question: str, response_a: str, response_b: str) -> dict:
    """Use GPT-4 to compare two model responses."""
    judgment = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[{
            "role": "system",
            "content": """You are an impartial judge. Compare two AI responses to the same question.
Evaluate on: accuracy, helpfulness, clarity, and completeness.
Return JSON: {"winner": "A" or "B" or "tie", "reasoning": "...", "scores": {"A": 1-10, "B": 1-10}}"""
        }, {
            "role": "user",
            "content": f"Question: {question}\n\nResponse A: {response_a}\n\nResponse B: {response_b}"
        }]
    )
    return json.loads(judgment.choices[0].message.content)


# Run evaluation suite
def evaluate_model(model_name: str, test_cases: list[dict]) -> dict:
    """Evaluate a model across multiple test cases."""
    scores = {"accuracy": [], "helpfulness": [], "format": []}

    for case in test_cases:
        response = generate(model_name, case["prompt"])

        # Automated checks
        if "expected_contains" in case:
            scores["accuracy"].append(
                any(kw in response for kw in case["expected_contains"])
            )

        # LLM judge for quality
        judgment = llm_judge(case["prompt"], response, case.get("reference", ""))
        scores["helpfulness"].append(judgment["scores"]["A"])

    return {k: sum(v)/len(v) for k, v in scores.items() if v}

The most important evaluation is on YOUR data, for YOUR use case. Public benchmarks tell you about general capability, but a model that scores 90% on MMLU might score 60% on your specific domain. Always build a custom eval set with 50-200 examples representative of your actual workload.