1 Month into Local LLM - What I've learned.

March 31, 2026 | 10 min read

One Month Into Local LLMs: What I've Learned

It's been about a month since I went down the local-LLM rabbit hole.

In that time I've:

Built llama.cpp from source
Written my own benchmarking software to see how my machine actually performs
Attempted to write a custom CUDA kernel — mostly as an excuse to understand what's happening underneath the abstractions. I'm still firmly in the "confused but curious" phase here.

Most of what follows is the kind of stuff I wish someone had laid out for me on day one: what all those cryptic model names mean, what my hardware can actually run, and where a model comes from before it ever lands on my disk.

That last paragraph was a lie.

In an ideal world, maybe that's how I learn. In reality, I tend to learn through experimentation rather than reading. I like to build things, break things, and then figure out why they broke. Most of the understanding in this post came from pulling on threads until something failed and then spending far too long figuring out what happened.

How did this all happen?
My machine
My goals
Ollama, then llama.cpp
Why I built from source
Building llama.cpp from source
The GGUF question
Quality vs size: the quantization dial
What my hardware can actually run
Where a model actually comes from
Closing thoughts

How did this all happen?

Let's begin with how we got here.

The latest Path of Exile season had just come out and my old RTX 2060 was struggling to play my new build. I needed a GPU upgrade.

I managed to find a 5060 Ti 16 GB for around £400 which felt like a steal for both gaming and local AI.

After burning out on the PoE season, I flew to Amsterdam to catch up with a friend, attend KubeCon, and spend some time being a tourist in the Netherlands again.

While I was there I worked on a couple of side projects:

RAIVIZ — a 3D visualization of the RAI Centre and where talks would be held
IAIKit — a playbook for building AI-powered influencers

Naturally after speaking with other builders, and in particular my friend Jans, the conversation shifted towards AI.

He explained how he was using hosted inference rather as he'd rather pay directly for compute than pay per token. In comparison, I was reliant on the other side of the spectrum - paying a token premium for frontier models. Both approaches have their trade offs.

The thing that bothered me was the sustainability of either approach. The fact that hosted inference has issues with latency, and that Frontier API's can charge exorbitant amounts for token usage.

This trip really got me thinking...

So, when I got back to London, I started digging into local models.

My machine

Before we begin, here's the hardware everything below was tested on:

GPU: NVIDIA RTX 5060 Ti (16 GB)
CPU: Intel Core i9-12900K
RAM: 32 GB
OS: Arch Linux

My goals

With the hardware sorted, here's what I actually wanted:

Run capable models locally
Avoid paying per N tokens
Ship local AI to production
Understand how the stack actually works

Ollama, then llama.cpp

Like most people, I started with Ollama. It worked amazingly well out the box, no friction. Excellent product.

Then one day at work I overheard someone mention that Ollama was just a wrapper over llama.cpp. I hate wrappers. This was where my building from source journey began.

Building llama.cpp from source

The RTX 5060 Ti is a Blackwell card with compute capability 12.0 (sm_120).

That means building specifically for Blackwell requires a recent CUDA toolkit.

The configuration I landed on looked like this:

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON

cmake --build build --config Release -j

The important flags:

GGML_CUDA=ON enables GPU inference
CMAKE_CUDA_ARCHITECTURES=120 targets Blackwell directly
CMAKE_BUILD_TYPE=Release enables compiler optimizations
LLAMA_CURL=ON allows downloading models directly from Hugging Face

Why I built from source

The biggest reason was control, prebuilt binaries need to work across a huge range of machines, GPUs and operating systems. Most of the time that's perfectly fine. But newer hardware tends to move faster than packaged releases. Building from source gave me confidence that I was targeting the architecture I actually owned as well as getting access to improvements before they make their way into packaged distributions.

The GGUF question

Once I had llama.cpp running, I opened Hugging Face and immediately ran into another question.

What is GGUF? I saw it everywhere.

At a high level, GGUF is a file format designed for efficient local inference. It stores model weights, metadata, tokenizer information and quantized tensors in a format that tools like llama.cpp can load efficiently.

Most of the GGUF files you'll find on Hugging Face aren't necessarily new models. They're usually existing models that have been converted into a format optimized for local inference.

And why does every model have twenty different versions of itself?

Eventually I realized they're all trying to answer the same question:

How much quality are you willing to trade for size?

Quality vs Size: The Quantization Dial

The analogy that finally made this click for me is image compression.

Think of a BF16 model as a RAW image from a professional camera.

Quantization is converting that RAW image into a JPEG. Naturally this means you are indirectly sacrificing information. The trick is the kind of information you are sacrificing.

Format	Approximate Bits	Quality
BF16/F16	16	Reference quality
Q8	8	Extremely close to BF16
Q6_K	~6.5	Near-lossless
Q5_K_M	~5.5	High quality
Q4_K_M	~4.8	Common sweet spot

For many local-LLM users, Q4_K_M ends up sitting at the knee of the curve.

You save a huge amount of memory while preserving most of the quality.

That's why it's usually the first quantization people recommend.

What My Hardware Can Actually Run

Let's start with where the constraints are present:

VRAM
Context length
Quantization level
Desired inference speed

A 16 GB card can comfortably run many 12B-class models entirely on the GPU.

The exact limit depends heavily on architecture, context size and quantization.

There isn't a universal rule that says:

X GB of VRAM = Y billion parameters

But 12B models turned out to be a very comfortable place to start.

Where a Model Actually Comes From

One of the biggest mental shifts for me was realizing that the model I download isn't really the model. By the time a GGUF file lands on my disk, it's usually several transformations removed from whatever the original research team actually trained.

It helped to think of it as a pipeline, where each stage hands its output to the next:

Architecture → Pre-training → Post-training → Quantization → Inference

Stage 1 — Architecture

Before any learning happens, someone has to write the model definition: how many layers it has, how the attention blocks are wired, the size of the embedding dimensions, and increasingly whether it routes tokens through a mixture of experts. At this point no weights exist at all — it's pure blueprint. You could instantiate the architecture and it would happily produce complete gibberish, because nothing in it has learned anything yet.

Stage 2 — Pre-training

This is the expensive part, and it's where the "intelligence" really comes from. The model starts with random weights and is fed an enormous amount of text, predicting the next token over and over. Each time it's wrong the error is measured, gradients are computed, and the weights get nudged in a slightly better direction. Repeat that loop trillions of times and the patterns in language slowly get baked into the weights.

This is also the stage that makes headlines, because it's where the money goes — depending on the scale of the run, pre-training can cost anywhere from a few million to hundreds of millions of dollars in compute. What comes out the other end is the base model: something that's extremely good at continuing text, but not yet good at being talked to.

Stage 3 — Post-training

The mental model that stuck with me is this: pre-training teaches the model language, post-training teaches it behaviour.

A raw base model will happily complete your sentence rather than answer your question. Post-training is what turns it into something that follows instructions, holds a conversation, and behaves like an assistant. Tool use, dialogue, and instruction-following all emerge here, through techniques like supervised fine-tuning and reinforcement learning from human feedback. LoRA fine-tuning — the lightweight approach most hobbyists use to nudge a model towards a particular task or personality — also lives in this stage.

Stage 4 — Quantization

This is the stage that produces all those Q4, Q5 and Q8 files I was confused about earlier. Crucially, no retraining happens here. Quantization simply stores the existing weights using fewer bits — the same model, just compressed. Logically it's still the network the researchers trained; it's just been squeezed down to something my 16 GB card can actually hold. It's also why you'll see one base model re-uploaded in a dozen quantized variants by the community: they're all the same weights at different levels of compression.

Stage 5 — Inference

Finally, the part you actually interact with. The model loads into memory, your prompt gets turned into tokens, those tokens become embeddings, and the embeddings flow through every layer until the model produces a probability distribution over the next token. One token gets sampled, appended to the sequence, and the whole thing runs again — over and over — until you've got a full response. Everything in the previous four stages exists to make this loop produce something coherent.

Once I could see these five stages clearly, it helped with conceptualising how an LLM works. "Base" versus "instruct" is just the line between stages 2 and 3. The alphabet soup of quant names is all stage 4. And llama.cpp only ever touches the last stage.

Closing Thoughts

Going in, I assumed local AI was mostly a hardware problem: get enough VRAM, download a model, done. A month later I think I had it almost backwards.

Nearly all of the questions that actually mattered turned out to be software questions. How was the model trained? What post-training shaped its behaviour? Which quantization should I run? And the most interesting of them all - why does a modern 9B model sometimes outperform an older 14B one? The hardware sets the ceiling, but it was the software stack that decided where I actually landed underneath it.

So that's roughly a month's worth of learning compressed into one post.

Next I want to dig deeper into the CUDA side of things, understand what actually makes inference fast, and maybe finally figure out whether writing a custom kernel is brilliance, madness, or some combination of the two.

zaakir.io | blog