Pretraining an LLM from Scratch: Data, Compute & Architecture
A practical guide to pretraining your own language model — data collection and cleaning, training infrastructure, distributed training with FSDP, and the engineering challenges at scale.
Should You Pretrain?
Pretraining is expensive: a 7B model costs $100K-500K in compute. Before pretraining, consider: (1) Can you fine-tune an existing model? (2) Is there a domain-specific model already available? (3) Do you have enough unique data (100B+ tokens)? Pretraining makes sense when you need a model for a language or domain with limited existing coverage, or when you need full control over the training data for compliance.
Data Pipeline: The Foundation
Data quality determines model quality more than anything else. The pipeline: (1) Collect raw data (Common Crawl, books, code, curated sources). (2) Deduplicate (MinHash, exact dedup). (3) Filter (language detection, quality scoring, toxic content removal). (4) Tokenize and shuffle. A good data pipeline takes months to build.
# Simplified data pipeline stages
from dataclasses import dataclass
@dataclass
class TrainingConfig:
model_size: str = "1.4B"
vocab_size: int = 32000
context_length: int = 2048
num_layers: int = 24
hidden_dim: int = 2048
num_heads: int = 16
num_kv_heads: int = 4 # GQA
ff_dim: int = 5632
learning_rate: float = 3e-4
min_lr: float = 3e-5
warmup_steps: int = 2000
total_tokens: int = 30_000_000_000 # 30B tokens
batch_size_tokens: int = 524_288 # ~512K tokens per batch
weight_decay: float = 0.1
grad_clip: float = 1.0
# Data pipeline sketch
def prepare_data(raw_dir: str, output_dir: str):
"""Multi-stage data preparation pipeline."""
# Stage 1: Language detection + quality filtering
docs = load_raw_documents(raw_dir)
docs = [d for d in docs if detect_language(d) == "en"]
docs = [d for d in docs if quality_score(d) > 0.7]
# Stage 2: Deduplication (MinHash LSH)
docs = minhash_dedup(docs, threshold=0.8)
# Stage 3: Content filtering
docs = [d for d in docs if not is_toxic(d)]
docs = [d for d in docs if not is_pii(d)]
# Stage 4: Tokenize and pack into sequences
tokenizer = train_bpe_tokenizer(docs, vocab_size=32000)
tokens = tokenizer.encode_batch(docs)
packed = pack_sequences(tokens, max_length=2048)
# Stage 5: Shuffle and save as memory-mapped arrays
save_mmap(shuffle(packed), output_dir)Distributed Training with FSDP
Models larger than a single GPU's memory require distributed training. Fully Sharded Data Parallelism (FSDP) shards model parameters, gradients, and optimizer states across GPUs. Each GPU holds only a fraction of the model. During forward/backward, parameters are gathered just-in-time and resharded after use. This enables training a 7B model on 8× A100 GPUs.
import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy
def setup_training(model, rank, world_size):
"""Set up FSDP distributed training."""
torch.distributed.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
# Wrap model with FSDP
model = FSDP(
model.cuda(rank),
sharding_strategy=ShardingStrategy.FULL_SHARD, # Shard everything
mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
),
auto_wrap_policy=transformer_auto_wrap_policy, # Wrap each transformer block
device_id=rank,
)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1,
)
return model, optimizer
# Training loop
def train_step(model, batch, optimizer, grad_clip=1.0):
input_ids = batch["input_ids"].cuda()
labels = batch["labels"].cuda()
outputs = model(input_ids)
loss = F.cross_entropy(
outputs.view(-1, outputs.size(-1)),
labels.view(-1),
)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
optimizer.step()
optimizer.zero_grad()
return loss.item()Pretraining compute estimate: A 1.4B model on 30B tokens takes ~1,000 A100-hours (~$2K on cloud). A 7B model on 1T tokens takes ~30,000 A100-hours (~$60K). A 70B model on 15T tokens takes ~1.7M A100-hours (~$3.4M). Scaling laws help predict performance before committing resources.
Learning Rate Schedule
Pretraining uses a warmup + cosine decay schedule. Warmup prevents early instability when gradients are large and noisy. Cosine decay smoothly reduces learning rate to 10% of peak. The peak learning rate scales with batch size and model size — too high causes loss spikes, too low wastes compute.
import math
def cosine_lr_schedule(step: int, warmup_steps: int, total_steps: int,
max_lr: float, min_lr: float) -> float:
"""Cosine learning rate schedule with linear warmup."""
if step < warmup_steps:
# Linear warmup
return max_lr * step / warmup_steps
elif step < total_steps:
# Cosine decay
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
else:
return min_lr
# Typical schedule for a 7B model:
# max_lr = 3e-4, min_lr = 3e-5, warmup = 2000 steps, total = 100K stepsKey lessons from pretraining at scale: (1) Data deduplication is critical — duplicates cause memorization and hurt generalization. (2) Loss spikes happen — checkpoint frequently and be ready to roll back. (3) Evaluate continuously during training, not just at the end. (4) Infrastructure reliability matters — a 30-day run that fails at day 25 is catastrophic.