The Two-File LLM

February 23, 2026 | 16 min read

Andrej Karpathy's hour-long talk on LLMs from November 2023, is what I consider to be one of the warmest introductions into the space.

The mental model the talk hands you has aged better than almost anything I've read about LLMs since. So I want to walk through the crux of it all in a more digestible format, why it's important and why I'd still hand it to a friend who asks "what's actually going on with AI right now?".

Two files

Karpathy opens by saying a 70-billion-parameter Llama 2 is, on your file system, two files:

a parameters file — a binary blob, ~140 GB at 2 bytes per parameter (it's stored as float16).
run.c — about 500 lines of C, with no dependencies, that know how to multiply the parameter matrices against an input sequence and produce the next token.

The neural network is the parameters file. The run code is mundane. The thing he keeps coming back to is that you can take those two files, a MacBook, and no internet connection at all, and you have a fully self-contained, talkable-to language model. The hard part isn't running it. As he puts it, "the magic really is in the parameters and how do we obtain them".

In 2026 that framing has gone from analogy to architecture. The local-LLM ecosystem — llama.cpp, ollama, mlx, vLLM — is the run code. A .gguf or .safetensors file is the parameters. You can download Llama 3.1 70B Instruct as a single 40 GB Q4 GGUF and serve it with a 30 MB binary. The run.c has grown well past 500 lines, mostly because of quantisation kernels for every architecture, but the shape is exactly what Karpathy described.

The deeper point he leaves implicit: those two files are the output of a months-long, multi-million-dollar process you can't reproduce from scratch even with the recipe. The compute has been spent. The parameter values are what they are.

That's why a model-weights leak is a seismic event for the field. And it's why running models locally is a fundamentally different posture from using them via API. With API access you're renting cognition. With local weights you own a frozen-in-time artefact of a process that may never be repeated.

My Incant voice dictation project runs entirely on NVIDIA's Parakeet model via sherpa-onnx. No API, no telemetry, no rate limit. The model on disk is the entire product.

Where the money lives

Karpathy gives the numbers for pretraining Llama 2 70B: roughly 10 terabytes of internet text, about 6,000 GPUs running for about 12 days, costing about $2 million. (He spreads those figures across a couple of minutes — I've stitched them together here.)

Inference, by comparison, is cheap. A single beefy machine can serve a 70B model. The asymmetry between training and inference cost is the central economic fact of LLMs — and he's blunt that by today's standards even those training numbers are "rookie numbers," off by 10× or more for frontier models, which run into the tens or hundreds of millions of dollars.

Three years on, the numbers have shifted in instructive ways.

Inference got dramatically cheaper. A Llama 3.1 8B Instruct at Q4_KM runs on a single RTX 4090 at ~80 tokens per second. A 70B at Q4 fits in 40 GB of VRAM (a single H100, or a pair of 4090s) at ~25–35 tok/s. Quantisation trades a small amount of quality for huge memory wins, and the loss is barely measurable on most benchmarks above about Q3. The "is local AI viable?"_ question has been settled.

Pretraining ballooned in absolute terms but collapsed in capability-per-dollar. Llama 3 was rumoured at ~$60 million of compute. GPT-4 ~$100 million. Current Frontier models routinely cross $200 million. But Karpathy's $2M Llama 2 70B is now beaten on most benchmarks by a free 8B model you can run on a laptop.

The "compressed the internet" framing aged spectacularly well — and it's worth slowing down on, because Karpathy explains it really well. The parameters, he says, are best thought of as a kind of "zip file of the internet": ~140 GB of weights standing in for ~10 TB of text, a compression ratio of roughly 100×. But — crucially — it's lossy, not lossless. There's no verbatim copy of the training data inside the model; it keeps "the gestalt."

Karpathy's broader point: pretraining is what makes LLMs an industrial product rather than a research artefact.

Predicting the next token

What the model actually does: given a sequence of tokens, predict the probability distribution over the next token, sample one, append it, repeat. That's the whole inference loop. His toy example is "cat sat on a" → the network predicts "mat" with ~97% probability.

Karpathy calls out that this shouldn't feel like enough to be intelligent. His argument for why it is: next-token prediction looks trivial, but to get good at it the network is forced to learn about the world. He pulls up a Wikipedia paragraph about Ruth Handler and points out that to predict the next word — her birth year, what she founded, when she died — the parameters have to actually encode that knowledge. There's a deep, provable relationship between prediction and compression: predict the next word well enough and you have, in effect, compressed the dataset.

The loss function during training is just "make the next-word prediction better". That's the entire objective. No notion of correctness, no notion of helpfulness, no notion of safety. Everything else — the apparent reasoning, the apparent knowledge, the apparent personality — falls out as a side effect of optimising that one number, hard, on enough data.

We expected intelligence to require an architecture engineered for it. It turns out you can get a lot of it as an emergent property of compressing the internet by predicting the next word.

Dreaming, hallucination, and inscrutability

Run a freshly pretrained base model with no prompt and it "dreams" internet documents. Karpathy shows the output: a chunk of plausible-looking Java code, a fake Amazon product listing — complete with an ISBN number the model simply invented because it has learned that "ISBN:" is followed by digits of about that length — and a Wikipedia-style article about a fish (the "black-nose dace") whose details turn out to be roughly correct even though that exact text appears nowhere in training.

This is where lossy pays off as a concept. The model is reconstructing the form of a document and filling it with whatever it half-remembers. Sometimes that's accurate, sometimes it's confabulated, and — his key line — you can't tell from the outside which is which. Hallucination isn't a bug bolted into an otherwise truthful system; it's the same machinery that makes the model work at all.

He pairs this with a second uncomfortable truth: we don't really understand the thing we've built. We know the architecture exactly, but the 100-billion-odd parameters are inscrutable — "we don't actually really know what these parameters are doing." His favourite illustration is the reversal curse: ask GPT-4 "who is Tom Cruise's mother?" and it correctly says Mary Lee Pfeiffer; ask "who is Mary Lee Pfeiffer's son?" and it draws a blank. The knowledge is stored directionally, not as a clean fact you can query from any angle. His conclusion is that LLMs are "mostly inscrutable artifacts" — closer to grown organisms than to engineered machines like a car — which is why a whole field (mechanistic interpretability) exists to reverse-engineer them, and why we evaluate them empirically by poking at behaviour rather than reading the spec.

In 2026 this has only become more true. Interpretability has had real wins, reasoning models hallucinate differently rather than less, and "the model is confidently wrong" remains the single most important thing to teach a non-technical user.

Base model vs assistant

The base model is what falls out of pretraining. It's a document completion engine. Give it the start of a Wikipedia article and it continues writing it. Give it a question and it might just continue with more questions, because in its training data one question is often followed by another, not by an answer.

To get something that behaves like ChatGPT or Claude, you need a second stage: supervised fine-tuning (SFT). The optimisation is identical — still next-token prediction — but you swap the dataset. Instead of raw internet text you train on ~100,000 hand-written question-and-answer conversations, produced by human labellers following detailed labelling instructions (OpenAI's InstructGPT brief boils down to "helpful, truthful, harmless", but in practice runs to tens of pages). The shift is quality over quantity: far fewer documents, but every one a high-grade exemplar. You're not teaching new facts; you're teaching the model which format and tone to adopt. Karpathy's framing: pretraining is about knowledge, fine-tuning is about alignment.

A third, optional stage: Reinforcement Learning from Human Feedback (RLHF) — uses comparison labels. His example: ask a human to write a haiku about paperclips and it's hard; ask them to pick the better of two model-written haikus and it's easy. Comparisons are cheaper and more reliable than authored answers, so the approach scales, and it's how OpenAI squeezed extra performance out of the InstructGPT line.

The improvement loop is iterative and cheap. You deploy, watch for misbehaviours, and when the assistant gets something wrong a labeller writes the corrected response, which goes straight back into the SFT set. Because fine-tuning costs ~a day rather than ~months, labs can do this weekly or daily — pretraining happens maybe once a year, but alignment is a fast feedback loop. Karpathy also flags that the "humans do all the labelling" picture is already dated: increasingly it's human–machine collaboration, with the model drafting and the human supervising.

So, what's changed since this talk from 2023:

DPO (Direct Preference Optimisation, late 2023) replaced classic RLHF with a simpler loss that doesn't need a separate reward model. Most current open-source models are DPO-tuned.
Online RL with verifiable rewards has surged in 2025. DeepSeek-R1 and OpenAI's o-series both use it. Instead of humans labelling which answer is better, automatic graders (does the code pass? does the maths verify?) score the outputs — which, as we'll see, is exactly the "reward function" Karpathy said the field was missing.
Instruction-tuned base models are common now (Llama 3 Instruct, Qwen3 Instruct etc. ship pre-aligned). Karpathy's clean separation between base model and assistant is blurring.

Most of the model's capability still traces back to stage 1, not stages 2 or 3 — SFT and RLHF polish capabilities pretraining already produced. If a base model can't do maths, no amount of preference labelling will fix it. Hence why frontier-lab spending is overwhelmingly on pretraining.

He also grounds all this with the Chatbot Arena leaderboard, which ranks models by ELO the same way you'd rank chess players: blind A/B votes on real prompts. In late 2023 the picture was a clean tier split — proprietary models (GPT-4, Claude) on top, open-weights models (Llama 2, Mistral-based Zephyr) chasing from below. That gap has narrowed dramatically since.

Scaling laws

The most empirically validated claim in the talk: the accuracy of next-token prediction is a "remarkably smooth, well-behaved, predictable" function of just two numbers — N, the number of parameters, and D, the amount of training data. And, in his words, these trends "do not seem to show signs of topping out."

It means there is no immediately visible wall. Better still, he points out that next-token accuracy correlates tightly with the downstream evals we actually care about — going from GPT-3.5 to GPT-4 lifts a whole battery of unrelated tests at once. So you can get "more powerful models for free" by buying a bigger cluster and more data. Scaling is, as he puts it, "one guaranteed path to success." This is the bet underwriting the entire GPU gold rush.

The scaling-laws story has three chapters. (Karpathy only gestures at the first; the rest is context I've added.)

Kaplan et al, 2020. OpenAI showed that loss scales as a power law in parameters, data, and compute, independently. If you have a budget, scale all three together.
Chinchilla, 2022. DeepMind refined Kaplan's recipe. For a fixed compute budget, the optimal ratio is roughly 20 tokens of data per parameter. Most pre-Chinchilla models were undertrained. This is why a Chinchilla-trained 70B beats a Kaplan-trained 175B.
Modern overtraining. Inference compute is much cheaper than training compute that everyone now trains past Chinchilla optimal. Llama 3 8B saw ~15T tokens (~1,800 tokens per parameter, 90× Chinchilla optimal). The model is "overtrained" in the technical sense; in practice it's just exceptional for its size.

What's happened since 2023:

The pure scaling curve held. GPT-4, Claude 3 Opus, Gemini Ultra, Llama 3.1 405B all sit roughly where you'd extrapolate from 2023.
Pretraining is hitting data walls, not technical ones. The internet is finite. Most frontier labs have exhausted the freely scrapable English internet. The 2024–25 pivot to synthetic data (have a model generate training data for the next model) and multimodal pretraining (audio and video tokens, vastly more plentiful) is partly a response.
Sparse mixture-of-experts has emerged as a scaling cheat code. A 200B-parameter MoE with 20B "active" parameters per forward pass costs 20B to run but learns like 200B. DeepSeek-V3, Mixtral, GPT-4 (rumoured) all use this. Karpathy didn't cover MoE; it's the most important architectural change since the talk.

Where he said it was going

Karpathy spends roughly the back third of the talk on future directions.

Tools. LLMs are bad at things humans reach for tools to do: arithmetic, looking up fresh information, running code. His demo walks through ChatGPT researching Scale AI's funding rounds — it issues a browser search (Bing), uses a calculator to impute missing valuations from ratios, writes Python (matplotlib) to plot the data, fits a trend line and extrapolates (cheerfully predicting Scale would be a "$2 trillion company"), and finally calls DALL·E to generate an image.

Tool calling is now a core capability of every frontier model. OpenAI shipped function calling in mid-2023; Anthropic shipped tool use shortly after. The 2024 Model Context Protocol (MCP) standardised it — tools declared in a model-agnostic JSON schema, called by the model, results returned as structured data. Claude Code, Cursor, and every agentic coding tool I've used this year are tool-calling loops wrapped in good UX.

Multimodality. Karpathy shows Greg Brockman's famous demo of ChatGPT turning a hand-drawn pencil sketch of a "MyJoke" website into working HTML and JavaScript, and talks about voice — speaking to the model and having it speak back, "like the movie Her." Today every frontier model is natively multimodal. Gemini ingests video. GPT-4o handles real-time voice. Claude does vision. Image generation now happens inside the same model as text (GPT-4o, Gemini) rather than via a bolted-on diffusion model. The "speak to it, show it pictures" UX he gestured at is simply the default in 2026.

System 1 vs System 2. The prediction I think aged best. He invokes Kahneman: System 1 is fast, instinctive, automatic (you don't compute 2+2, you just know it); System 2 is slow, deliberate, effortful (you have to actually work through 17×24). LLMs in late 2023 were pure System 1 — they emit one token after another, each taking about the same time, with no internal "let me think about this" stage. What he wanted was the ability to "convert time into accuracy": let the model take 30 seconds, lay out a tree of possibilities, reflect, and come back more confident. He literally sketches accuracy-vs-time as a curve we'd like to bend upward.

This came true on schedule. OpenAI's o1 (Sept 2024) was the first frontier model to expose "thinking time" as a deliberate axis. DeepSeek-R1 (Jan 2025) replicated it openly. Claude shipped extended thinking; Gemini 2.5 has thinking tokens. The whole field pivoted from "scale pretraining" to "also scale inference-time compute on reasoning," and that accuracy-vs-time plot is now the canonical axis of every reasoning benchmark.

Self-improvement. Karpathy points at AlphaGo, which had two stages: first imitate strong human players, then — crucially — surpass them through self-play against a simple, automatic reward (did you win the game?). Forty days of self-play and it was beyond the best humans. His open question: what's the stage-two equivalent for language? Today's assistants are stuck in stage one — imitating human labellers, and therefore capped by them. The blocker, he says, is the "lack of a reward criterion in the general case": there's no cheap automatic check for whether an arbitrary paragraph is "good." But in narrow domains with a verifiable answer, you can build that reward. That's exactly what RL-with-verifiable-rewards models (R1, o-series) did — maths and code, where correctness is checkable, became the first places language models started to self-improve past their teachers.

Customisation. He points to OpenAI's then-new GPTs App Store as a first stab at specialised models — custom instructions plus uploaded files, where retrieval-augmented generation (RAG) lets the model "browse" your documents instead of the open web and cite chunks of them. He imagines a future of many task-specialised LLM "experts" rather than one model for everything, and full fine-tuning becoming a customisation lever too. RAG in particular has become arguably the single most production-relevant idea in the talk — it's the backbone of most enterprise LLM deployments today.

The LLM is an operating system

The single most prescient slide in the talk. Karpathy reframes the LLM not as a chatbot but as "the kernel process of an emerging operating system" — something that coordinates memory, compute, and tools to solve a problem. He re-labels every box:

The LLM is the kernel. Central process. Takes inputs, does computation, orchestrates everything else.
The context window is RAM. The model's working memory — and, in his words, a "finite, precious resource" that the kernel pages information in and out of to do your task.
Disk / long-term memory is the knowledge it reaches for — the internet via browsing, files via RAG. (Karpathy ties "disk" mainly to browsable/external knowledge; I find it cleaner to also think of the frozen weights themselves as a read-only disk, but that's my extension of his picture.)
Tools are peripherals / syscalls. Calculators, browsers, code interpreters.
Multimodality is I/O. Vision, audio, generated images.

He even carries the analogy into the ecosystem: just as desktop computing has proprietary operating systems (Windows, macOS) alongside an open-source Linux world, LLMs have proprietary systems (GPT, Claude, Gemini) alongside a fast-maturing open-source stack then anchored by Llama.

There are some caveats to this however. We have RAM (context windows), but they're tiny by OS standards — a 200k-token context is about 800 KB of text. We have disk (weights), but it's read-only at runtime: no fine-tuning during inference, no continual learning. We have syscalls (tools), but no real process scheduling, no IPC, no robust permission model.

Every gap is an open research direction. The first lab to ship a real fork() for the LLM-OS — parallel agent branches with shared state — wins the next plateau.

The talk is still the best primer I've seen on what an LLM actually is and why I'd still hand it to a friend today