The Two-File LLM
| 13 min read
I rewatched Andrej Karpathy's hour-long talk on LLMs last week. It's from November 2023, which is ancient in this field. The model names are obsolete. GPT-4 is a baseline now. Reasoning models exist. Local inference on a single GPU is normal.
Most of it still holds up.
That's not nothing. The mental model the talk hands you has aged better than almost anything I've read about LLMs since. So I want to walk through what's still right, what's changed, and why I'd still hand it to a friend who asks "what's actually going on with AI right now?".
Two files
Karpathy opens by saying a 70-billion-parameter Llama 2 is, on your file system, two files:
parameters.bin— a binary blob, ~140 GB at 2 bytes per parameter.run.c— about 500 lines that know how to multiply the parameter matrices against an input sequence and produce the next token.
The neural network is the parameters file. The run code is mundane. His point: the difficulty isn't in running the model, it's in producing the parameters file.
In 2026 that framing has gone from cute analogy to literal architecture. The local-LLM ecosystem — llama.cpp, ollama, mlx, vLLM — really is the run code. A .gguf or .safetensors file really is the parameters. You can download Llama 3.1 70B Instruct as a single 40 GB Q4 GGUF and serve it with a 30 MB binary. The run.c has grown well past 500 lines, mostly because of quantisation kernels for every architecture, but the shape is exactly what Karpathy described.
The deeper point he leaves implicit: those two files are the output of a months-long, multi-million-dollar process you can't reproduce from scratch even with the recipe. The compute has been spent. The parameter values are what they are.
That's why a model-weights leak is a seismic event for the field. And it's why running models locally is a fundamentally different posture from using them via API. With API access you're renting cognition. With local weights you own a frozen-in-time artefact of a process that may never be repeated.
My Incant voice dictation project runs entirely on NVIDIA's Parakeet model via sherpa-onnx. No API, no telemetry, no rate limit. The model on disk is the entire product.
Where the money lives
Karpathy gives the famous numbers for pretraining Llama 2 70B:
About 6,000 GPUs running for about 12 days, costing about $2 million, on roughly 10 terabytes of internet text.
Inference, by comparison, is cheap. A single beefy machine can serve a 70B model. The asymmetry between training and inference cost is the central economic fact of LLMs.
Three years on, the numbers have shifted in instructive ways.
Inference got dramatically cheaper. A Llama 3.1 8B Instruct at Q4_KM runs on a single RTX 4090 at ~80 tokens per second. A 70B at Q4 fits in 40 GB of VRAM (a single H100, or a pair of 4090s) at ~25–35 tok/s. Quantisation trades a small amount of quality for huge memory wins, and the loss is barely measurable on most benchmarks above about Q3. The "is local AI viable?"_ question is settled.
Pretraining ballooned in absolute terms but collapsed in capability-per-dollar. Llama 3 was rumoured at ~$60 million of compute. GPT-4 ~$100 million. Frontier 2025 models routinely cross $200 million. But Karpathy's $2M Llama 2 70B is now beaten on most benchmarks by a free 8B model you can run on a laptop. The story isn't "training got cheaper", it's "the floor got cheaper faster than the ceiling got more expensive".
The "compressed the internet" framing aged spectacularly well. Modern corpora have grown from his quoted ~10 TB to ~15 TB for Llama 3 and ~30 TB+ for the largest 2025 models. The Chinchilla-optimal ratio of about 20 tokens per parameter is the field's default rule of thumb now, with modern recipes overtraining by 5–10× past that because inference is cheap enough to pay for baking more in.
The actual ladder of compute costs in 2026, roughly:
- Frontier pretraining: $100M – $1B. Not for individuals.
- Open-weights pretraining (~30B fresh): $1M – $10M. Within reach for a well-funded startup.
- Fine-tuning on your domain: $50 – $5,000. A weekend project.
- LoRA / QLoRA adaptation: $5 – $200. An afternoon on a 4090.
- Inference: cents per million tokens. Negligible.
Karpathy was making the point that pretraining is what makes LLMs an industrial product rather than a research artefact. That's even more true now, with frontier labs spending more on a single training run than entire academic fields spend in a year.
Predicting the next token
What the model actually does: given a sequence of tokens, predict the probability distribution over the next token, sample one, append it, repeat. That's the whole inference loop.
Karpathy calls out that this shouldn't feel like enough to be intelligent. The magic is in how good the prediction has to get before it generalises into something that feels like reasoning.
Tokenisation. Text gets broken into tokens — sub-word pieces produced by byte-pair encoding (BPE) or its modern variants (SentencePiece, tiktoken, Llama's BBPE). A token is roughly 3–4 characters of English. The model never sees raw characters; it sees integers indexing a vocabulary of ~32k–256k tokens. This is a non-trivial source of weirdness. The model "thinks" in tokens, which is why it's bad at counting letters in a word ("How many Rs in strawberry?"), bad at character-level reasoning, and surprisingly good at compression-aware tasks.
The forward pass. Each token maps to an embedding vector (~4k–16k dimensions). The sequence flows through a stack of transformer blocks, typically 32, 80, or 120 of them. Each block is two operations: multi-head self-attention (lets each token "look at" earlier tokens) and a feed-forward MLP (transforms each token in place). Residual connections and layer normalisation around each. That's literally the whole architecture. It's been more or less unchanged since "Attention Is All You Need" in 2017.
The output. After the final block, the last token's representation is multiplied by the embedding matrix again and softmaxed into a distribution over the vocabulary. You pick a token (greedy, top-k, top-p, temperature-controlled), append it, run the whole thing again.
The loss function during training is just "make the prediction better". That's the entire objective. No notion of correctness, no notion of helpfulness, no notion of safety. Just minimise cross-entropy on next-token prediction across the training corpus. Everything else — the apparent reasoning, the apparent knowledge, the apparent personality — falls out as a side effect of optimising that one number, hard, on enough data.
We expected intelligence to require an engineered architecture for it. It turns out you can get it as an emergent property of compressing the internet by predicting the next word.
Base model vs assistant
The base model is what falls out of pretraining. It's a document completion engine. Give it the start of a Wikipedia article and it continues writing it. Give it "Question: What is the capital of France?" and it might continue "Question: What is the capital of Germany?", because the most likely continuation of one question in its training data was another question, not an answer.
To get something that behaves like ChatGPT or Claude, you need a second stage: supervised fine-tuning (SFT) on ~100,000 human-written question-and-answer pairs. The objective is unchanged (still next-token prediction), but the data is different. You're not teaching the model new facts; you're teaching it which format and tone to use when responding.
A third stage, RLHF, has human labellers compare two model outputs and pick which is better. Comparison labels are cheaper and more reliable than authored answers, so this scales.
What's changed since 2023:
- DPO (Direct Preference Optimisation, late 2023) replaced classic RLHF with a simpler loss that doesn't need a separate reward model. Most current open-source models are DPO-tuned.
- Online RL with verifiable rewards has surged in 2025. DeepSeek-R1 and OpenAI's o-series both use it. Instead of humans labelling which answer is better, automatic graders (does the code pass? does the maths verify?) score the outputs, and the model learns to produce verifiable-correct answers.
- Instruction-tuned base models are common now (Llama 3 Instruct, Qwen3 Instruct etc. ship pre-aligned). Karpathy's clean separation between base model and assistant is blurring.
Most of the model's capability comes from stage 1, not stages 2 or 3. SFT and RLHF polish capabilities pretraining already produced. If a base model is bad at maths, no amount of preference labelling will fix it. This is why frontier-lab spending is overwhelmingly on pretraining, not alignment.
Scaling laws
The most empirically validated claim in the talk:
We have an algorithm we can scale, and we know with extreme confidence that as we throw more compute and more data at it, we will get a better model.
That sounds modest. It is not. What it says is: there is no immediately visible wall. Performance scales as a power law in compute and data, and the curve hasn't bent.
The scaling-laws story has three chapters:
- Kaplan et al, 2020. OpenAI showed that loss scales as a power law in parameters, data, and compute, independently. If you have a budget, scale all three together.
- Chinchilla, 2022. DeepMind refined Kaplan's recipe. For a fixed compute budget, the optimal ratio is roughly 20 tokens of data per parameter. Most pre-Chinchilla models were undertrained. This is why a Chinchilla-trained 70B beats a Kaplan-trained 175B.
- Modern overtraining. Inference compute is so much cheaper than training compute that everyone now trains past Chinchilla optimal. Llama 3 8B saw ~15 trillion tokens (~1,800 tokens per parameter, 90× Chinchilla optimal). The model is "overtrained" in the technical sense; in practice it's just exceptional for its size.
What's happened since 2023:
- The pure scaling curve held. GPT-4, Claude 3 Opus, Gemini Ultra, Llama 3.1 405B all sit roughly where you'd extrapolate from 2023.
- Pretraining is hitting data walls, not technical ones. The internet is finite. Most frontier labs have exhausted the freely scrapable English internet. The 2024–25 pivot to synthetic data (have a model generate training data for the next model) and multimodal pretraining (audio and video tokens, vastly more plentiful) is partly a response.
- Sparse mixture-of-experts has emerged as a scaling cheat code. A 200B-parameter MoE with 20B "active" parameters per forward pass costs 20B to run but learns like 200B. DeepSeek-V3, Mixtral, GPT-4 (rumoured) all use this. Karpathy didn't cover MoE; it's the most important architectural change since the talk.
Tools, multimodality, System 2
Karpathy spends a third of the talk on where things were going. Three threads.
Tools. LLMs are bad at things humans use tools for. They can't reliably do arithmetic, look up today's news, or run code. Karpathy's gesture: let them call external tools. He demos ChatGPT using a calculator and a Python interpreter mid-response.
This exploded. Tool calling is now a core capability of every frontier model. OpenAI shipped function calling in mid-2023; Anthropic shipped tool use shortly after. The 2024 Model Context Protocol (MCP) standardised it — tools are declared in a model-agnostic JSON schema, models call them, results come back as structured data. Claude Code, Cursor, and every agentic coding tool I've seen this year are tool-calling loops wrapped in good UX.
Multimodality. Karpathy shows a slide where he talks to ChatGPT through voice, and where it sees a hand-drawn website mock-up and generates the HTML. In late 2023 this was barely shipping. Today every frontier model is natively multimodal. Gemini 1.5/2 ingests video. GPT-4o handles real-time voice. Claude 3.5+ does vision. Image generation in the same model as text generation (rather than via a separate diffusion model) is now common (GPT-4o, Gemini). The "speak to it, show it pictures" UX he gestured at is the default in 2026.
System 1 vs System 2. The prediction Karpathy is most excited about, and the one I think aged best. He invokes Kahneman: System 1 is instinctive, fast, often wrong. System 2 is deliberate, slow, reasoned. LLMs in late 2023 were pure System 1 — they emit a stream of tokens with no internal "let me think about that" stage. The aspiration he names is trees of thought, deliberation, verification, taking longer to produce better answers.
This came true on schedule. OpenAI's o1 (Sept 2024) was the first frontier model to expose "thinking time" as a deliberate axis. Give it more inference compute, get better answers, at the cost of latency. DeepSeek-R1 (Jan 2025) replicated it openly. Claude 3.7 Sonnet shipped "extended thinking". Gemini 2.5 has "thinking" tokens. The whole field has pivoted from "scale pretraining" to "also scale inference-time compute on reasoning". The plot of "accuracy vs inference time" Karpathy sketched on a whiteboard is now the canonical y-axis of every reasoning-model benchmark.
The deep implication for anyone running models locally: inference is no longer trivially cheap. If your local model thinks for 30 seconds before answering, that's 30 seconds of full GPU utilisation. The "inference is free" assumption of 2023 breaks down for the most useful workloads. olbench and tools like it have to start measuring reasoning-mode throughput, not just raw tok/s.
The LLM is an operating system
The single most prescient slide in the talk. Karpathy puts up an OS schematic and re-labels every box:
- The LLM is the kernel. Central process. Takes inputs, does computation, produces outputs.
- The context window is RAM. Working memory the kernel has at any moment.
- Model weights are the disk. Long-term, slow-to-update memory of the world.
- Tools are peripherals / syscalls. Calculators, browsers, file systems, code interpreters. The kernel orchestrates them.
- The user is a process issuing requests through a shell.
- Multimodality is I/O. Vision, audio, generated images.
In late 2023 this was an analogy. By 2026 it's a literal description of how Claude Code, Cursor, and Aider work. You sit in a terminal (the shell), give an LLM (the kernel) a goal, it calls tools (file reads, edits, bash, git, web fetches) until the goal is done. MCP is exactly syscalls-for-LLMs.
The analogy is also a frame for thinking about what's missing. We have RAM (context windows), but they're tiny by OS standards — a 200k-token context is about 800 KB of text. We have disk (weights), but it's read-only at runtime: no fine-tuning during inference, no continual learning. We have syscalls (tools), but no process scheduling, no IPC, no robust permission model.
Every gap is an open research direction. The first lab to ship a real fork() for the LLM-OS — parallel agent branches with shared state — wins the next plateau.
The talk is an hour long. Most of it is still right. The mental model is the most accurate primer I've seen of what an LLM is on disk, where training time goes, why scaling works, and why the whole stack will start to look like an operating system.
Watch it.