Insights & Field Notes

AI Systems Journal

LLM TrainingData EngineeringAI InfrastructureMachine Learning

Where Do LLMs Learn From? Understanding Training Data Sources

Behind every capable language model lies billions of carefully curated tokens. This deep dive explores where training data comes from, how it's filtered and cleaned, and why the data pipeline is just as important as the model architecture itself. From Common Crawl's messy HTML to synthetic data generation, understand the complete journey from raw web pages to production-ready training sets.

Published Jan 5, 2026 14 minute read Technical deep dive
Visualization of data flowing through training pipelines for large language models

The foundation of intelligence

When you interact with ChatGPT, Claude, or Gemini, you're experiencing the end result of a sophisticated data pipeline that ingested and processed trillions of tokens. These models didn't learn language from thin air—they learned from the collective written output of humanity, carefully filtered and structured for machine learning.

The quality and diversity of training data directly determines a model's capabilities. A model trained only on news articles won't write code. One trained exclusively on English won't understand Chinese. And one trained on low-quality spam will produce low-quality outputs. The data is the foundation—everything else builds on top.

Modern language models draw from four primary data sources: massive web crawls that provide breadth, professional databases that ensure quality, synthetic data that addresses scarcity, and commercial partnerships that secure competitive advantages. Each source brings different strengths and requires different processing strategies.

Public web crawling: The foundation of scale

01
  • Common Crawl provides over 80% of training data volume, serving as a non-profit organization that regularly crawls global web pages and makes them freely available
  • Raw web data requires extensive filtering to remove advertisements, HTML code, navigation elements, and low-quality content before it becomes useful for training
  • High-quality websites like Wikipedia, Reddit, Quora, and reputable news sources are specifically targeted, with Reddit's upvote mechanism serving as a natural quality filter
  • The sheer volume from web crawling provides broad general knowledge coverage, but quality varies significantly across sources
  • Filtering pipelines use classifiers to identify and extract meaningful long-form human-written text from the noise of raw HTML

Professional domain databases: Quality over quantity

02
  • Code repositories like GitHub power models' ability to write software, with datasets containing Python, Java, C++, and dozens of other programming languages
  • Books from collections like BookCorpus provide logically structured narratives with strong contextual connections, essential for complex reasoning capabilities
  • Academic papers from arXiv, PubMed, and other scholarly platforms enable models to discuss technical subjects from quantum mechanics to medical research
  • Professional databases are carefully curated and generally require less aggressive filtering than web crawl data
  • The structured nature of these sources helps models learn coherent long-form generation and domain-specific reasoning patterns

Synthetic data: The emerging frontier

03
  • Synthetic data generation addresses the looming shortage of high-quality human-written content available on the public internet
  • Advanced models like GPT-4 can generate training examples with detailed reasoning chains (Chain-of-Thought) to teach smaller models
  • Model self-play creates logically rigorous dialogues where models iteratively improve by learning from their own outputs
  • Synthetic data risks 'model collapse' when training exclusively on machine-generated content without proper quality controls
  • Strategic synthetic data generation is becoming essential as available public human-written data approaches exhaustion

Commercial licensing and human feedback

04
  • Leading companies secure exclusive access to premium content through multi-million dollar licensing deals with publishers like Reuters and Stack Overflow
  • Reinforcement Learning from Human Feedback (RLHF) uses thousands of annotators to score model responses and align outputs with human preferences
  • Commercial partnerships provide access to proprietary datasets not available in public web crawls, creating competitive advantages
  • Human feedback isn't technically training data but critically shapes model behavior through post-training alignment processes
  • These curated sources help models develop safety guardrails, improved instruction-following, and more natural conversational abilities

Understanding Common Crawl's raw data

05
  • Common Crawl provides raw HTML mirrors of web pages, not clean text—this includes navigation bars, advertisements, JavaScript code, and footer content
  • The 'dirtiness' of raw data means training pipelines must extract meaningful content from surrounding noise like sidebars, pop-ups, and duplicate mirror sites
  • Character encoding errors, special escape characters, and garbled text require sophisticated detection and cleaning algorithms
  • Automated filtering is essential because manual curation at the scale of billions of web pages would be economically impossible
  • The raw data preserves the chaotic reality of the internet, requiring multiple cleaning stages before becoming usable training material

The packaging and tokenization process

06
  • Individual web pages are not fed one at a time—instead, documents are concatenated into long sequences to maximize GPU utilization
  • Special separator tokens like <|endoftext|> mark boundaries between different documents in the packed training sequences
  • Token sequences are split into fixed-length chunks (typically 4096 or 8192 tokens) to match model context windows, even if this splits documents mid-sentence
  • A single training sample may contain portions of 3-5 different documents packed together for computational efficiency
  • Final training data is converted to binary formats (.bin or .idx files) for fast streaming to GPUs during training

The multi-stage filtering funnel

07
  • Heuristic filters use simple rules: keyword filtering for profanity, length filtering to remove fragments, and language detection to ensure text quality
  • Deduplication through MinHash or LSH algorithms identifies near-duplicate content, keeping only unique examples to improve training efficiency
  • Quality classifiers trained on manually labeled data score billions of pages, retaining only the top 10-20% based on predicted quality
  • FastText and similar tools detect language and encoding issues, removing pages with excessive garbled characters or mixed languages
  • The filtering process typically reduces raw Common Crawl data by 80-90%, keeping only the highest-quality subset for actual training

Typical dataset composition

The well-known Pile dataset (approximately 800GB) provides a reference for how training data is distributed across sources. Here's a breakdown of its composition:

Common Crawl (filtered)52%
Web archives & books16%
GitHub & code repos10%
Wikipedia4%
Academic papers (arXiv, PubMed)8%
Other curated sources10%

Note: Exact proportions vary by model and training objective, but Common Crawl typically dominates by volume.

What raw Common Crawl data looks like

Common Crawl doesn't give you clean paragraphs ready for training—it gives you raw HTML mirrors of web pages. Here's what a crawled news article might actually contain before processing:

<nav> Homepage | About Us | Advertising </nav> <div class="ads"> Click to claim your $999 gift package! </div> <script> var x = 10; ... </script> <p> Beijing time, January 5th, the latest research shows ... </p> <div class="sidebar"> Recommended articles ... </div> <div class="footer"> Copyright © 2026 XXX Company </div>

Why is this "dirty"?

  • Noise: Navigation bars, footers, ads, and sidebars add no training value
  • Garbled text: Encoding errors and special escape characters create nonsense strings
  • Low quality: Auto-generated spam, duplicate mirror sites, and inappropriate content

The filtering pipeline's job is to extract only that middle paragraph—the actual human-written content—while discarding everything else. At billions of pages, this requires automated classifiers and heuristic rules working in concert.

From HTML to training tokens

After filtering and cleaning, the data pipeline converts messy HTML into a standardized format. The most common intermediate format is JSONL (JSON Lines), where each line represents one clean document:

{"text": "Beijing time, January 5th, the latest research shows...", "source": "common_crawl", "date": "2026-01-05"} {"text": "Python is a widely used programming language...", "source": "github_readme", "date": "2025-12-20"}

Pre-training transformation:

  • Tokenization: Convert clean text into numerical token sequences using BPE, WordPiece, or SentencePiece
  • Packing: Concatenate multiple documents end-to-end with special separator tokens to maximize GPU efficiency
  • Binary conversion: Transform token sequences into .bin or .idx files for fast streaming during training

The final training data is optimized for one thing: keeping GPUs fed with a continuous stream of tokens at maximum throughput. Every stage of the pipeline—from crawling to filtering to tokenization—exists to support efficient, high-quality model training at scale.

Key takeaways for practitioners

Data diversity matters

Models need exposure to varied writing styles, domains, and formats. Over-indexing on one source creates capability gaps and biases.

Quality over raw volume

Aggressive filtering reduces data by 80-90%, but the remaining high-quality subset trains better models than using everything.

Deduplication is critical

The internet is full of duplicate content. Deduplication prevents models from memorizing repeated text and improves generalization.

Synthetic data is inevitable

As high-quality human data becomes scarce, carefully generated synthetic data will become essential for continued model improvement.

The data pipeline is the model

In many ways, building a capable language model is more about data engineering than it is about model architecture. Transformers are well-understood and widely available. What separates GPT-4 from a mediocre open-source model isn't primarily the neural network—it's the quality, diversity, and scale of the training data.

Teams building custom models or fine-tuning existing ones should invest heavily in their data pipelines. This means robust filtering, careful deduplication, thoughtful source selection, and continuous quality monitoring. The model will only be as good as the data you feed it.

As we move forward, expect training data strategies to evolve rapidly. The easy gains from scraping the public internet are largely exhausted. The next generation of models will likely blend web crawls, licensed content, synthetic generation, and domain-specific corpora in increasingly sophisticated ways. Understanding these data sources and processing pipelines is essential for anyone working seriously with language models.