Tokenization: BPE, SentencePiece & Why It Matters
Deep dive into how LLMs break text into tokens — BPE, WordPiece, SentencePiece algorithms. Understand why tokenization affects model performance, cost, and multilingual ability.
Why Tokenization Is Critical
Tokenization is the first and last step of every LLM interaction. It determines what the model 'sees', how much context fits in the window, how much you pay per API call, and why models struggle with certain tasks. A poor tokenizer can make a model 3x more expensive and significantly worse at non-English languages. Yet it's often overlooked.
Byte Pair Encoding (BPE)
BPE starts with individual characters and iteratively merges the most frequent pair of adjacent tokens. After thousands of merges, common words become single tokens while rare words are split into subwords. This creates a vocabulary that balances coverage (any text can be encoded) with efficiency (common words are compact).
def train_bpe(corpus: str, vocab_size: int) -> dict[tuple, str]:
"""Simplified BPE training algorithm."""
# Start with character-level tokens
words = corpus.split()
token_freqs = {}
for word in words:
chars = tuple(word) + ("</w>",) # End-of-word marker
token_freqs[chars] = token_freqs.get(chars, 0) + 1
merges = {}
current_vocab_size = len(set(c for word in token_freqs for c in word))
while current_vocab_size < vocab_size:
# Count all adjacent pairs
pair_counts = {}
for word, freq in token_freqs.items():
for i in range(len(word) - 1):
pair = (word[i], word[i + 1])
pair_counts[pair] = pair_counts.get(pair, 0) + freq
if not pair_counts:
break
# Merge the most frequent pair
best_pair = max(pair_counts, key=pair_counts.get)
merged_token = best_pair[0] + best_pair[1]
merges[best_pair] = merged_token
# Apply merge to all words
new_freqs = {}
for word, freq in token_freqs.items():
new_word = []
i = 0
while i < len(word):
if i < len(word) - 1 and (word[i], word[i+1]) == best_pair:
new_word.append(merged_token)
i += 2
else:
new_word.append(word[i])
i += 1
new_freqs[tuple(new_word)] = freq
token_freqs = new_freqs
current_vocab_size += 1
return merges
# Example: "low" appears 5x, "lower" 2x, "newest" 6x
# BPE might learn merges: ('n','e') -> 'ne', ('ne','w') -> 'new', etc.Vocabulary Size Trade-offs
GPT-4 uses ~100K tokens. LLaMA uses 32K. Larger vocabulary means more tokens are single-token words (efficient encoding, shorter sequences) but requires a larger embedding matrix (more parameters). Smaller vocabulary means more subword splitting (longer sequences, more compute) but smaller model size. The sweet spot is 32K-100K for most applications.
Tokenization gotchas: 'tokenization' = 1 token, but ' tokenization' (with space) = 2 tokens. Numbers like '123456' are split as '123' + '456'. Non-English text often uses 2-5x more tokens than English for the same meaning. Always count tokens (not characters) for context length planning.
Analyzing Tokenization in Practice
import tiktoken
# Compare tokenizers
def analyze_tokenization(text: str):
encodings = {
"cl100k (GPT-4)": tiktoken.get_encoding("cl100k_base"),
"o200k (GPT-4o)": tiktoken.get_encoding("o200k_base"),
}
for name, enc in encodings.items():
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]
print(f"\n{name}: {len(tokens)} tokens")
print(f"Tokens: {decoded[:20]}{'...' if len(decoded) > 20 else ''}")
# English is efficient
analyze_tokenization("The quick brown fox jumps over the lazy dog.")
# cl100k: 10 tokens | o200k: 10 tokens
# Code is moderately efficient
analyze_tokenization("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)")
# cl100k: 25 tokens | o200k: 24 tokens
# Non-English can be expensive
analyze_tokenization("transformerは自然言語処理を革命的に変えました")
# cl100k: 15 tokens | o200k: 11 tokens (improved!)When building cost-sensitive applications, measure token counts on your actual data. A customer support bot handling Japanese queries will cost significantly more than one handling English, purely due to tokenization differences.