Portfolio/Learn/Pretraining an LLM from Scratch: Data, Compute & Architecture
Machine Learning & AIAdvanced

Pretraining an LLM from Scratch: Data, Compute & Architecture

A practical guide to pretraining your own language model — data collection and cleaning, training infrastructure, distributed training with FSDP, and the engineering challenges at scale.

25 min read
March 7, 2026
PretrainingDistributed TrainingFSDPData PipelinePython

Should You Pretrain?

Pretraining is expensive: a 7B model costs $100K-500K in compute. Before pretraining, consider: (1) Can you fine-tune an existing model? (2) Is there a domain-specific model already available? (3) Do you have enough unique data (100B+ tokens)? Pretraining makes sense when you need a model for a language or domain with limited existing coverage, or when you need full control over the training data for compliance.

Data Pipeline: The Foundation

Data quality determines model quality more than anything else. The pipeline: (1) Collect raw data (Common Crawl, books, code, curated sources). (2) Deduplicate (MinHash, exact dedup). (3) Filter (language detection, quality scoring, toxic content removal). (4) Tokenize and shuffle. A good data pipeline takes months to build.

python
# Simplified data pipeline stages
from dataclasses import dataclass

@dataclass
class TrainingConfig:
    model_size: str = "1.4B"
    vocab_size: int = 32000
    context_length: int = 2048
    num_layers: int = 24
    hidden_dim: int = 2048
    num_heads: int = 16
    num_kv_heads: int = 4      # GQA
    ff_dim: int = 5632
    learning_rate: float = 3e-4
    min_lr: float = 3e-5
    warmup_steps: int = 2000
    total_tokens: int = 30_000_000_000  # 30B tokens
    batch_size_tokens: int = 524_288     # ~512K tokens per batch
    weight_decay: float = 0.1
    grad_clip: float = 1.0


# Data pipeline sketch
def prepare_data(raw_dir: str, output_dir: str):
    """Multi-stage data preparation pipeline."""

    # Stage 1: Language detection + quality filtering
    docs = load_raw_documents(raw_dir)
    docs = [d for d in docs if detect_language(d) == "en"]
    docs = [d for d in docs if quality_score(d) > 0.7]

    # Stage 2: Deduplication (MinHash LSH)
    docs = minhash_dedup(docs, threshold=0.8)

    # Stage 3: Content filtering
    docs = [d for d in docs if not is_toxic(d)]
    docs = [d for d in docs if not is_pii(d)]

    # Stage 4: Tokenize and pack into sequences
    tokenizer = train_bpe_tokenizer(docs, vocab_size=32000)
    tokens = tokenizer.encode_batch(docs)
    packed = pack_sequences(tokens, max_length=2048)

    # Stage 5: Shuffle and save as memory-mapped arrays
    save_mmap(shuffle(packed), output_dir)

Distributed Training with FSDP

Models larger than a single GPU's memory require distributed training. Fully Sharded Data Parallelism (FSDP) shards model parameters, gradients, and optimizer states across GPUs. Each GPU holds only a fraction of the model. During forward/backward, parameters are gathered just-in-time and resharded after use. This enables training a 7B model on 8× A100 GPUs.

python
import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy

def setup_training(model, rank, world_size):
    """Set up FSDP distributed training."""
    torch.distributed.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

    # Wrap model with FSDP
    model = FSDP(
        model.cuda(rank),
        sharding_strategy=ShardingStrategy.FULL_SHARD,  # Shard everything
        mixed_precision=MixedPrecision(
            param_dtype=torch.bfloat16,
            reduce_dtype=torch.bfloat16,
            buffer_dtype=torch.bfloat16,
        ),
        auto_wrap_policy=transformer_auto_wrap_policy,  # Wrap each transformer block
        device_id=rank,
    )

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1,
    )

    return model, optimizer


# Training loop
def train_step(model, batch, optimizer, grad_clip=1.0):
    input_ids = batch["input_ids"].cuda()
    labels = batch["labels"].cuda()

    outputs = model(input_ids)
    loss = F.cross_entropy(
        outputs.view(-1, outputs.size(-1)),
        labels.view(-1),
    )

    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    optimizer.step()
    optimizer.zero_grad()

    return loss.item()

Pretraining compute estimate: A 1.4B model on 30B tokens takes ~1,000 A100-hours (~$2K on cloud). A 7B model on 1T tokens takes ~30,000 A100-hours (~$60K). A 70B model on 15T tokens takes ~1.7M A100-hours (~$3.4M). Scaling laws help predict performance before committing resources.

Learning Rate Schedule

Pretraining uses a warmup + cosine decay schedule. Warmup prevents early instability when gradients are large and noisy. Cosine decay smoothly reduces learning rate to 10% of peak. The peak learning rate scales with batch size and model size — too high causes loss spikes, too low wastes compute.

python
import math

def cosine_lr_schedule(step: int, warmup_steps: int, total_steps: int,
                       max_lr: float, min_lr: float) -> float:
    """Cosine learning rate schedule with linear warmup."""
    if step < warmup_steps:
        # Linear warmup
        return max_lr * step / warmup_steps
    elif step < total_steps:
        # Cosine decay
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
    else:
        return min_lr

# Typical schedule for a 7B model:
# max_lr = 3e-4, min_lr = 3e-5, warmup = 2000 steps, total = 100K steps

Key lessons from pretraining at scale: (1) Data deduplication is critical — duplicates cause memorization and hurt generalization. (2) Loss spikes happen — checkpoint frequently and be ready to roll back. (3) Evaluate continuously during training, not just at the end. (4) Infrastructure reliability matters — a 30-day run that fails at day 25 is catastrophic.