Tokenization: BPE, SentencePiece & Why It Matters

Deep dive into how LLMs break text into tokens — BPE, WordPiece, SentencePiece algorithms. Understand why tokenization affects model performance, cost, and multilingual ability.

15 min read

TokenizationBPESentencePieceNLPPython

Why Tokenization Is Critical

Tokenization is the first and last step of every LLM interaction. It determines what the model 'sees', how much context fits in the window, how much you pay per API call, and why models struggle with certain tasks. A poor tokenizer can make a model 3x more expensive and significantly worse at non-English languages. Yet it's often overlooked.

Byte Pair Encoding (BPE)

BPE starts with individual characters and iteratively merges the most frequent pair of adjacent tokens. After thousands of merges, common words become single tokens while rare words are split into subwords. This creates a vocabulary that balances coverage (any text can be encoded) with efficiency (common words are compact).

python

def train_bpe(corpus: str, vocab_size: int) -> dict[tuple, str]:
    """Simplified BPE training algorithm."""
    # Start with character-level tokens
    words = corpus.split()
    token_freqs = {}
    for word in words:
        chars = tuple(word) + ("</w>",)  # End-of-word marker
        token_freqs[chars] = token_freqs.get(chars, 0) + 1

    merges = {}
    current_vocab_size = len(set(c for word in token_freqs for c in word))

    while current_vocab_size < vocab_size:
        # Count all adjacent pairs
        pair_counts = {}
        for word, freq in token_freqs.items():
            for i in range(len(word) - 1):
                pair = (word[i], word[i + 1])
                pair_counts[pair] = pair_counts.get(pair, 0) + freq

        if not pair_counts:
            break

        # Merge the most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        merged_token = best_pair[0] + best_pair[1]
        merges[best_pair] = merged_token

        # Apply merge to all words
        new_freqs = {}
        for word, freq in token_freqs.items():
            new_word = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and (word[i], word[i+1]) == best_pair:
                    new_word.append(merged_token)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_freqs[tuple(new_word)] = freq
        token_freqs = new_freqs
        current_vocab_size += 1

    return merges

# Example: "low" appears 5x, "lower" 2x, "newest" 6x
# BPE might learn merges: ('n','e') -> 'ne', ('ne','w') -> 'new', etc.

Vocabulary Size Trade-offs

GPT-4 uses ~100K tokens. LLaMA uses 32K. Larger vocabulary means more tokens are single-token words (efficient encoding, shorter sequences) but requires a larger embedding matrix (more parameters). Smaller vocabulary means more subword splitting (longer sequences, more compute) but smaller model size. The sweet spot is 32K-100K for most applications.

Tokenization gotchas: 'tokenization' = 1 token, but ' tokenization' (with space) = 2 tokens. Numbers like '123456' are split as '123' + '456'. Non-English text often uses 2-5x more tokens than English for the same meaning. Always count tokens (not characters) for context length planning.

Analyzing Tokenization in Practice

python

import tiktoken

# Compare tokenizers
def analyze_tokenization(text: str):
    encodings = {
        "cl100k (GPT-4)": tiktoken.get_encoding("cl100k_base"),
        "o200k (GPT-4o)": tiktoken.get_encoding("o200k_base"),
    }

    for name, enc in encodings.items():
        tokens = enc.encode(text)
        decoded = [enc.decode([t]) for t in tokens]
        print(f"\n{name}: {len(tokens)} tokens")
        print(f"Tokens: {decoded[:20]}{'...' if len(decoded) > 20 else ''}")

# English is efficient
analyze_tokenization("The quick brown fox jumps over the lazy dog.")
# cl100k: 10 tokens  |  o200k: 10 tokens

# Code is moderately efficient
analyze_tokenization("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)")
# cl100k: 25 tokens  |  o200k: 24 tokens

# Non-English can be expensive
analyze_tokenization("transformerは自然言語処理を革命的に変えました")
# cl100k: 15 tokens  |  o200k: 11 tokens (improved!)

When building cost-sensitive applications, measure token counts on your actual data. A customer support bot handling Japanese queries will cost significantly more than one handling English, purely due to tokenization differences.

Algorithms & DSBeginner

Arrays & Strings: The Foundation of DSA

Master the most fundamental data structure — arrays and strings. Learn traversal patterns, two-pointer technique, sliding window, and common interview patterns.

ArraysStringsTwo PointersSliding Window

15 min read

Read

Algorithms & DSBeginner

Hash Tables: O(1) Average Lookup Explained

Understand how hash tables work internally — hash functions, collision resolution, load factors — and master hash map patterns for solving problems efficiently.

Hash TablesHash MapsSetsCounting

14 min read

Read

Algorithms & DSBeginner

Linked Lists: Pointers, Patterns & Pitfalls

Master singly and doubly linked lists — insertion, deletion, reversal, cycle detection, and the fast/slow pointer technique that solves countless problems.

Linked ListsPointersFast SlowReversal

16 min read

Read