Portfolio/Learn/Large Language Models: How GPT, Claude & LLaMA Actually Work
Machine Learning & AIIntermediate

Large Language Models: How GPT, Claude & LLaMA Actually Work

Demystify LLMs — tokenization, pretraining objectives, scaling laws, emergent abilities, and the engineering behind training models with hundreds of billions of parameters.

24 min read
March 19, 2026
LLMGPTPretrainingScaling LawsTokenization

What Makes a Language Model 'Large'?

A Large Language Model (LLM) is a transformer-based neural network trained on massive text corpora to predict the next token. 'Large' refers to parameter count — GPT-3 has 175B parameters, LLaMA 3 goes up to 405B, and frontier models are even larger. But size alone isn't the story: training data quality, training compute (FLOPs), and architectural choices matter just as much.

Tokenization: How Text Becomes Numbers

LLMs don't see characters or words — they see tokens. Tokenizers like BPE (Byte Pair Encoding), used by GPT, or SentencePiece, used by LLaMA, split text into subword units. Common words become single tokens; rare words are split into pieces. A typical vocabulary is 32K-128K tokens. Tokenization affects everything: context length, multilingual performance, and even math ability.

python
# BPE tokenization example using tiktoken (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4's encoding

text = "Transformers revolutionized natural language processing."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded tokens: {[enc.decode([t]) for t in tokens]}")

# Output:
# Text: Transformers revolutionized natural language processing.
# Tokens: [8963, 388, 14110, 1534, 5933, 4221, 8773, 13]
# Token count: 8
# Decoded tokens: ['Transform', 'ers', ' revolution', 'ized', ' natural', ' language', ' processing', '.']

Tokenization quirks: 'tokenization' is 1 token but 'token ization' is 2. Numbers are often split digit by digit. This is why LLMs struggle with character-level tasks and arithmetic — they literally don't see individual characters.

Pretraining: Next-Token Prediction at Scale

LLMs are pretrained on a simple objective: predict the next token given all previous tokens. The model sees trillions of tokens from books, websites, code, and conversations. Through gradient descent over this massive dataset, the model learns grammar, facts, reasoning patterns, and even code generation. The loss function is cross-entropy between predicted and actual next tokens.

python
import torch
import torch.nn as nn

class CausalLM(nn.Module):
    """Simplified causal language model (GPT-style)."""

    def __init__(self, vocab_size: int, embed_dim: int, num_layers: int, num_heads: int):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, embed_dim)
        self.pos_embed = nn.Embedding(8192, embed_dim)  # Learned positions
        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, embed_dim * 4)
            for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(embed_dim)
        self.lm_head = nn.Linear(embed_dim, vocab_size, bias=False)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        batch, seq_len = input_ids.shape
        positions = torch.arange(seq_len, device=input_ids.device)

        x = self.token_embed(input_ids) + self.pos_embed(positions)

        # Causal mask: each token can only attend to previous tokens
        mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device))

        for block in self.blocks:
            x = block(x, mask)

        x = self.norm(x)
        logits = self.lm_head(x)  # (batch, seq_len, vocab_size)
        return logits

# Training loop sketch
# loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

Scaling Laws: Chinchilla and Beyond

The Chinchilla paper (2022) showed that model size and training data should scale together. For a compute-optimal model, tokens should be ~20x the parameter count. GPT-3 (175B params) was undertrained by this standard. LLaMA demonstrated that smaller models trained on more data can outperform larger undertrained models — LLaMA-13B beat GPT-3 on many benchmarks with 13x fewer parameters.

Key scaling insight: doubling parameters gives diminishing returns without proportionally more data. Modern training runs use 10-15 trillion tokens. Data quality, deduplication, and curriculum matter more than raw dataset size.

Emergent Abilities

Certain abilities appear only at sufficient scale — they're absent in small models, then suddenly emerge. Chain-of-thought reasoning, in-context learning (few-shot), code generation, and multilingual transfer all exhibit this pattern. The exact threshold varies by task, but generally models below 7B parameters lack many emergent capabilities. This is why model scale matters and why the field has pushed toward ever-larger models.

The LLM Training Pipeline

Modern LLM training has three phases: (1) Pretraining — next-token prediction on trillions of tokens, creating a base model. (2) Supervised Fine-Tuning (SFT) — training on curated instruction-response pairs. (3) RLHF/DPO — aligning the model with human preferences using reinforcement learning or direct preference optimization. Each phase serves a different purpose: knowledge acquisition, instruction following, and safety/alignment.