Pretraining an LLM from Scratch: Data, Compute & Architecture

A practical guide to pretraining your own language model — data collection and cleaning, training infrastructure, distributed training with FSDP, and the engineering challenges at scale.

25 min read

PretrainingDistributed TrainingFSDPData PipelinePython

Should You Pretrain?

Pretraining is expensive: a 7B model costs $100K-500K in compute. Before pretraining, consider: (1) Can you fine-tune an existing model? (2) Is there a domain-specific model already available? (3) Do you have enough unique data (100B+ tokens)? Pretraining makes sense when you need a model for a language or domain with limited existing coverage, or when you need full control over the training data for compliance.

Data Pipeline: The Foundation

Data quality determines model quality more than anything else. The pipeline: (1) Collect raw data (Common Crawl, books, code, curated sources). (2) Deduplicate (MinHash, exact dedup). (3) Filter (language detection, quality scoring, toxic content removal). (4) Tokenize and shuffle. A good data pipeline takes months to build.

python

# Simplified data pipeline stages
from dataclasses import dataclass

@dataclass
class TrainingConfig:
    model_size: str = "1.4B"
    vocab_size: int = 32000
    context_length: int = 2048
    num_layers: int = 24
    hidden_dim: int = 2048
    num_heads: int = 16
    num_kv_heads: int = 4      # GQA
    ff_dim: int = 5632
    learning_rate: float = 3e-4
    min_lr: float = 3e-5
    warmup_steps: int = 2000
    total_tokens: int = 30_000_000_000  # 30B tokens
    batch_size_tokens: int = 524_288     # ~512K tokens per batch
    weight_decay: float = 0.1
    grad_clip: float = 1.0


# Data pipeline sketch
def prepare_data(raw_dir: str, output_dir: str):
    """Multi-stage data preparation pipeline."""

    # Stage 1: Language detection + quality filtering
    docs = load_raw_documents(raw_dir)
    docs = [d for d in docs if detect_language(d) == "en"]
    docs = [d for d in docs if quality_score(d) > 0.7]

    # Stage 2: Deduplication (MinHash LSH)
    docs = minhash_dedup(docs, threshold=0.8)

    # Stage 3: Content filtering
    docs = [d for d in docs if not is_toxic(d)]
    docs = [d for d in docs if not is_pii(d)]

    # Stage 4: Tokenize and pack into sequences
    tokenizer = train_bpe_tokenizer(docs, vocab_size=32000)
    tokens = tokenizer.encode_batch(docs)
    packed = pack_sequences(tokens, max_length=2048)

    # Stage 5: Shuffle and save as memory-mapped arrays
    save_mmap(shuffle(packed), output_dir)

Distributed Training with FSDP

Models larger than a single GPU's memory require distributed training. Fully Sharded Data Parallelism (FSDP) shards model parameters, gradients, and optimizer states across GPUs. Each GPU holds only a fraction of the model. During forward/backward, parameters are gathered just-in-time and resharded after use. This enables training a 7B model on 8× A100 GPUs.

python

import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy

def setup_training(model, rank, world_size):
    """Set up FSDP distributed training."""
    torch.distributed.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

    # Wrap model with FSDP
    model = FSDP(
        model.cuda(rank),
        sharding_strategy=ShardingStrategy.FULL_SHARD,  # Shard everything
        mixed_precision=MixedPrecision(
            param_dtype=torch.bfloat16,
            reduce_dtype=torch.bfloat16,
            buffer_dtype=torch.bfloat16,
        ),
        auto_wrap_policy=transformer_auto_wrap_policy,  # Wrap each transformer block
        device_id=rank,
    )

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1,
    )

    return model, optimizer


# Training loop
def train_step(model, batch, optimizer, grad_clip=1.0):
    input_ids = batch["input_ids"].cuda()
    labels = batch["labels"].cuda()

    outputs = model(input_ids)
    loss = F.cross_entropy(
        outputs.view(-1, outputs.size(-1)),
        labels.view(-1),
    )

    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    optimizer.step()
    optimizer.zero_grad()

    return loss.item()

Pretraining compute estimate: A 1.4B model on 30B tokens takes ~1,000 A100-hours (~$2K on cloud). A 7B model on 1T tokens takes ~30,000 A100-hours (~$60K). A 70B model on 15T tokens takes ~1.7M A100-hours (~$3.4M). Scaling laws help predict performance before committing resources.

Learning Rate Schedule

Pretraining uses a warmup + cosine decay schedule. Warmup prevents early instability when gradients are large and noisy. Cosine decay smoothly reduces learning rate to 10% of peak. The peak learning rate scales with batch size and model size — too high causes loss spikes, too low wastes compute.

python

import math

def cosine_lr_schedule(step: int, warmup_steps: int, total_steps: int,
                       max_lr: float, min_lr: float) -> float:
    """Cosine learning rate schedule with linear warmup."""
    if step < warmup_steps:
        # Linear warmup
        return max_lr * step / warmup_steps
    elif step < total_steps:
        # Cosine decay
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
    else:
        return min_lr

# Typical schedule for a 7B model:
# max_lr = 3e-4, min_lr = 3e-5, warmup = 2000 steps, total = 100K steps

Key lessons from pretraining at scale: (1) Data deduplication is critical — duplicates cause memorization and hurt generalization. (2) Loss spikes happen — checkpoint frequently and be ready to roll back. (3) Evaluate continuously during training, not just at the end. (4) Infrastructure reliability matters — a 30-day run that fails at day 25 is catastrophic.

Algorithms & DSBeginner

Arrays & Strings: The Foundation of DSA

Master the most fundamental data structure — arrays and strings. Learn traversal patterns, two-pointer technique, sliding window, and common interview patterns.

ArraysStringsTwo PointersSliding Window

15 min read

Read

Algorithms & DSBeginner

Hash Tables: O(1) Average Lookup Explained

Understand how hash tables work internally — hash functions, collision resolution, load factors — and master hash map patterns for solving problems efficiently.

Hash TablesHash MapsSetsCounting

14 min read

Read

Algorithms & DSBeginner

Linked Lists: Pointers, Patterns & Pitfalls

Master singly and doubly linked lists — insertion, deletion, reversal, cycle detection, and the fast/slow pointer technique that solves countless problems.

Linked ListsPointersFast SlowReversal

16 min read

Read