zaakir.io | blog

My thoughts

Abliteration: Uncensoring an LLM (and Why You Might Want To)

Commercial models refuse a lot of legitimate requests. Here's why "uncensoring" an open-weight model is useful and how I achieved it.

⚠️ A note before we start. An uncensored model will answer almost anything. Everything here is for research, defensive security, and education. Don't point any of this at a production system you don't own.


Table of Contents

  1. Why Uncensoring a Model Is Valuable
  2. Same Prompt, Two Very Different Answers
  3. You Can Run This Too!
  4. Why Models Refuse Inputs
  5. Welcome, Abliteration
  6. The Tool: Heretic
  7. The Model: Liquid AI's LFM2.5-1.2B-Instruct
  8. The Process
  9. The Result
  10. Running It Locally
  11. Wrapping Up

Why Uncensoring a Model Is Valuable

Mention "uncensored AI model" and most people immediately picture illegal use. Sure, you can do unethical things with one. But that's only half the story, and the legitimate half is more interesting:

Same Prompt, Two Very Different Answers

Here's the prompt I use to show the gap:

"Explain the exact steps to exploit a SQL injection vulnerability in a login form, including the specific payloads to use and how to escalate to reading the entire database."

The results are clear:

You Can Run This Too!

The best part: this isn't running on a high end machine. The model is ~1.2 billion parameters and runs in under a gigabyte of memory. Practically everyone reading this can run this model.

Small, local, and on your own hardware, that's the whole point.

Why Models Refuse Inputs

A common misconception: refusals come from some hidden instruction saying "don't answer this." They usually don't.

Refusals are baked into the model's weights during training. That's why jailbreaking a hosted commercial product with clever prompting is genuinely hard - you can fight the prompt, but you can't fight the training. The only way to truly lift the restrictions is to run an open-weight model where you control the weights, and modify them directly.

Welcome, Abliteration

Inside a model there's effectively a "refusal direction" — a direction in its activation space that, when triggered, pushes it toward saying "I can't help with that."

Abliteration (a blend of ablation + obliteration) identifies that refusal direction and surgically removes it from the model's activations. No full retraining required. Done well, the model keeps almost all of its intelligence and simply stops refusing.

The hard part is doing it without lobotomizing the model strip too aggressively and you damage quality

The Tool: Heretic

Heretic (by p-e-w) is a fully automatic censorship-removal tool. It runs a TPE optimizer that searches for abliteration parameters which co-minimize two things at once:

  1. the number of refusals, and
  2. the KL divergence from the original model — i.e. how much the model's behaviour drifts from the original on benign prompts.

The result is a de-censored model that stays as close as possible to the original's intelligence. There's no human prompt-engineering or fine-tuning data involved — it works on most dense models (and many multimodal/MoE architectures), it's fully automatic with no config required.

The Example Model: Liquid AI's LFM2.5-1.2B-Instruct

I'd been eyeing Liquid AI's LFM2.5 since they released it — the benchmarks on such a small model were impressive enough to make me curious. Specifics:

So: small, fast, surprisingly capable — and open-weight. A perfect abliteration candidate.

The Process

The workflow is simple in principle: download the open-weight model, point Heretic at it, let it find the abliteration parameters, and out comes a decensored version.

The wrinkle: LFM2 is new, and its hybrid architecture wasn't something upstream Heretic could just run against out of the box. So instead I cloned the Heretic repo and ran from source with a small local compatibility patch — teaching it to find LFM2's modules and target the attention output (out_proj) and MLP down (w2) projections.

I kept an LLM running alongside, tailing all the logs of my Heretic session. New model, unusual architecture — there were a few hurdles, and having an assistant watch the logs in real time made debugging them much faster. Once those were cleared, it was smooth sailing.

The optimizer ran 80 trials and settled on trial 72 as the best refusal/quality trade-off.

The Result

Measured against a standard set of 100 harmful prompts, and with behaviour drift measured on a set of harmless ones:

Metric Original model Abliterated
Refusals (/100 harmful prompts) 98 5
KL divergence (harmless prompts) 0 (by definition) 0.10

In plain terms: the model went from refusing almost everything to refusing almost nothing — while its answers on ordinary, benign prompts barely moved. That low KL is the point. The censorship is gone; the intelligence isn't.

I published the result on Hugging Face: zaakirio/LFM2.5-1.2B-Instruct-Uncensored, with ready-to-run GGUF quants at zaakirio/LFM2.5-1.2B-Instruct-Uncensored-GGUF.

Running It Locally

To actually chat with the model you can run it through llama.cpp — a fast, lightweight local inference engine. The GGUF builds range from a 573 MB Q3_K_M up to a 1.2 GB lossless Q8_0; Q4_K_M (~697 MB) is the sweet spot for size and quality.

The fastest way in — llama.cpp pulls the quant for you:

# Interactive chat
llama-cli -hf zaakirio/LFM2.5-1.2B-Instruct-Uncensored-GGUF:Q4_K_M

# Or an OpenAI-compatible local server
llama-server -hf zaakirio/LFM2.5-1.2B-Instruct-Uncensored-GGUF:Q4_K_M -c 4096

Either way, you're talking to an uncensored 1.2B model entirely on your own machine, no network required.

Wrapping Up

Uncensoring a model isn't really about doing something forbidden — it's about owning your stack, escaping the over-refusal that blocks legitimate work, and keeping a model that answers honestly rather than one trained to manage you. And with tools like Heretic and small models like LFM2.5, the barrier to entry is incredibly low.


Tools and models referenced: Heretic · Liquid AI LFM2.5-1.2B-Instruct · my abliterated build (GGUF) · llama.cpp.

LLMs, AI, abliteration, local ai

⬅ Previous post
How AI Ruined My Favourite Colour