1 Month into Local LLM - What I've learned.
| 10 min read
One Month Into Local LLMs: What I've Learned
It's been about a month since I went down the local-LLM rabbit hole.
In that time I've:
- Built llama.cpp from source
- Written my own benchmarking software to see how my machine actually performs
- Attempted to write a custom CUDA kernel — mostly as an excuse to understand what's happening underneath the abstractions. I'm still firmly in the "confused but curious" phase here.
Most of what follows is the kind of stuff I wish someone had laid out for me on day one: what all those cryptic model names mean, what my hardware can actually run, and where a model comes from before it ever lands on my disk.
That last paragraph was a lie.
In an ideal world, maybe that's how I learn. In reality, I tend to learn through experimentation rather than reading. I like to build things, break things, and then figure out why they broke. Most of the understanding in this post came from pulling on threads until something failed and then spending far too long figuring out what happened.
Contents
- How did this all happen?
- My machine
- My goals
- Ollama, then llama.cpp
- Why I built from source
- Building llama.cpp from source
- The GGUF question
- Quality vs size: the quantization dial
- What my hardware can actually run
- Where a model actually comes from
- Closing thoughts
How did this all happen?
Let's begin with how we got here.
The latest Path of Exile season had just come out and my old RTX 2060 was struggling to play my new build. I needed a GPU upgrade.
I managed to find a 5060 Ti 16 GB for around £400 which felt like a steal for both gaming and local AI.
After burning out on the PoE season, I flew to Amsterdam to catch up with a friend, attend KubeCon, and spend some time being a tourist in the Netherlands again.
While I was there I worked on a couple of side projects:
- RAIVIZ — a 3D visualization of the RAI Centre and where talks would be held
- IAIKit — a playbook for building AI-powered influencers
Naturally after speaking with other builders, and in particular my friend Jans, the conversation shifted towards AI.
He explained how he was using hosted inference rather as he'd rather pay directly for compute than pay per token. In comparison, I was reliant on the other side of the spectrum - paying a token premium for frontier models. Both approaches have their trade offs.
The thing that bothered me was the sustainability of either approach. The fact that hosted inference has issues with latency, and that Frontier API's can charge exorbitant amounts for token usage.
This trip really got me thinking...
So, when I got back to London, I started digging into local models.
My machine
Before we begin, here's the hardware everything below was tested on:
- GPU: NVIDIA RTX 5060 Ti (16 GB)
- CPU: Intel Core i9-12900K
- RAM: 32 GB
- OS: Arch Linux
My goals
With the hardware sorted, here's what I actually wanted:
- Run capable models locally
- Avoid paying per N tokens
- Ship local AI to production
- Understand how the stack actually works
Ollama, then llama.cpp
Like most people, I started with Ollama. It worked amazingly well out the box, no friction. Excellent product.
Then one day at work I overheard someone mention that Ollama was just a wrapper over llama.cpp. I hate wrappers. This was where my building from source journey began.
Building llama.cpp from source
The RTX 5060 Ti is a Blackwell card with compute capability 12.0 (sm_120).
That means building specifically for Blackwell requires a recent CUDA toolkit.
The configuration I landed on looked like this:
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=120 \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=ON
cmake --build build --config Release -j
The important flags:
GGML_CUDA=ONenables GPU inferenceCMAKE_CUDA_ARCHITECTURES=120targets Blackwell directlyCMAKE_BUILD_TYPE=Releaseenables compiler optimizationsLLAMA_CURL=ONallows downloading models directly from Hugging Face
Why I built from source
The biggest reason was control, prebuilt binaries need to work across a huge range of machines, GPUs and operating systems. Most of the time that's perfectly fine. But newer hardware tends to move faster than packaged releases. Building from source gave me confidence that I was targeting the architecture I actually owned as well as getting access to improvements before they make their way into packaged distributions.
The GGUF question
Once I had llama.cpp running, I opened Hugging Face and immediately ran into another question.
What is GGUF? I saw it everywhere.
At a high level, GGUF is a file format designed for efficient local inference. It stores model weights, metadata, tokenizer information and quantized tensors in a format that tools like llama.cpp can load efficiently.
Most of the GGUF files you'll find on Hugging Face aren't necessarily new models. They're usually existing models that have been converted into a format optimized for local inference.
And why does every model have twenty different versions of itself?
Eventually I realized they're all trying to answer the same question:
How much quality are you willing to trade for size?
Quality vs Size: The Quantization Dial
The analogy that finally made this click for me is image compression.
Think of a BF16 model as a RAW image from a professional camera.
Quantization is converting that RAW image into a JPEG. Naturally this means you are indirectly sacrificing information. The trick is the kind of information you are sacrificing.
| Format | Approximate Bits | Quality |
|---|---|---|
| BF16/F16 | 16 | Reference quality |
| Q8 | 8 | Extremely close to BF16 |
| Q6_K | ~6.5 | Near-lossless |
| Q5_K_M | ~5.5 | High quality |
| Q4_K_M | ~4.8 | Common sweet spot |
For many local-LLM users, Q4_K_M ends up sitting at the knee of the curve.
You save a huge amount of memory while preserving most of the quality.
That's why it's usually the first quantization people recommend.
What My Hardware Can Actually Run
Let's start with where the constraints are present:
- VRAM
- Context length
- Quantization level
- Desired inference speed
A 16 GB card can comfortably run many 12B-class models entirely on the GPU.
The exact limit depends heavily on architecture, context size and quantization.
There isn't a universal rule that says:
X GB of VRAM = Y billion parameters
But 12B models turned out to be a very comfortable place to start.
Where a Model Actually Comes From
One of the biggest mental shifts for me was realizing that the model I download isn't really the model. By the time a GGUF file lands on my disk, it's usually several transformations removed from whatever the original research team actually trained.
It helped to think of it as a pipeline, where each stage hands its output to the next:
Architecture → Pre-training → Post-training → Quantization → Inference
Stage 1 — Architecture
Before any learning happens, someone has to write the model definition: how many layers it has, how the attention blocks are wired, the size of the embedding dimensions, and increasingly whether it routes tokens through a mixture of experts. At this point no weights exist at all — it's pure blueprint. You could instantiate the architecture and it would happily produce complete gibberish, because nothing in it has learned anything yet.
Stage 2 — Pre-training
This is the expensive part, and it's where the "intelligence" really comes from. The model starts with random weights and is fed an enormous amount of text, predicting the next token over and over. Each time it's wrong the error is measured, gradients are computed, and the weights get nudged in a slightly better direction. Repeat that loop trillions of times and the patterns in language slowly get baked into the weights.
This is also the stage that makes headlines, because it's where the money goes — depending on the scale of the run, pre-training can cost anywhere from a few million to hundreds of millions of dollars in compute. What comes out the other end is the base model: something that's extremely good at continuing text, but not yet good at being talked to.
Stage 3 — Post-training
The mental model that stuck with me is this: pre-training teaches the model language, post-training teaches it behaviour.
A raw base model will happily complete your sentence rather than answer your question. Post-training is what turns it into something that follows instructions, holds a conversation, and behaves like an assistant. Tool use, dialogue, and instruction-following all emerge here, through techniques like supervised fine-tuning and reinforcement learning from human feedback. LoRA fine-tuning — the lightweight approach most hobbyists use to nudge a model towards a particular task or personality — also lives in this stage.
Stage 4 — Quantization
This is the stage that produces all those Q4, Q5 and Q8 files I was confused about earlier. Crucially, no retraining happens here. Quantization simply stores the existing weights using fewer bits — the same model, just compressed. Logically it's still the network the researchers trained; it's just been squeezed down to something my 16 GB card can actually hold. It's also why you'll see one base model re-uploaded in a dozen quantized variants by the community: they're all the same weights at different levels of compression.
Stage 5 — Inference
Finally, the part you actually interact with. The model loads into memory, your prompt gets turned into tokens, those tokens become embeddings, and the embeddings flow through every layer until the model produces a probability distribution over the next token. One token gets sampled, appended to the sequence, and the whole thing runs again — over and over — until you've got a full response. Everything in the previous four stages exists to make this loop produce something coherent.
Once I could see these five stages clearly, it helped with conceptualising how an LLM works. "Base" versus "instruct" is just the line between stages 2 and 3. The alphabet soup of quant names is all stage 4. And llama.cpp only ever touches the last stage.
Closing Thoughts
Going in, I assumed local AI was mostly a hardware problem: get enough VRAM, download a model, done. A month later I think I had it almost backwards.
Nearly all of the questions that actually mattered turned out to be software questions. How was the model trained? What post-training shaped its behaviour? Which quantization should I run? And the most interesting of them all - why does a modern 9B model sometimes outperform an older 14B one? The hardware sets the ceiling, but it was the software stack that decided where I actually landed underneath it.
So that's roughly a month's worth of learning compressed into one post.
Next I want to dig deeper into the CUDA side of things, understand what actually makes inference fast, and maybe finally figure out whether writing a custom kernel is brilliance, madness, or some combination of the two.