Atomic Llama: Running Local LLMs at Full Speed with MTP + TurboQuant

Atomic Llama: MTP + TurboQuant for 50% More Speed

If you've run a 7B–70B parameter model locally, you know the pain: generation speeds measured in single-digit to low-teens tokens per second. The bottleneck isn't your GPU's compute — it's memory bandwidth. Two techniques have emerged independently to attack both problems: Multi-Token Prediction (MTP), which drafts multiple tokens in parallel, and TurboQuant, which squishes the KV cache to 3 bits per element. But running them together required juggling separate forks, patching, and broken builds. Enter the Atomic Llama ecosystem — community-driven forks of llama.cpp that combine these innovations into a single, unified build. The result: up to 50% more throughput, 6× smaller KV caches, and drop-in compatibility with the GGUF ecosystem.

What's in the Box?

Atomic Llama combines three major innovations into one llama.cpp fork:

1. Multi-Token Prediction (MTP)

MTP is a form of speculative decoding. In standard autoregressive generation, the model produces one token at a time, each requiring a full forward pass. MTP changes the game by giving the model assistant heads — extra prediction heads trained during the GGUF quantization process that can predict multiple next tokens in a single pass.

Here's the flow:

The model's main head generates token 1.
The assistant heads simultaneously predict tokens 2, 3, 4, and 5.
The main model verifies all drafted tokens in a single forward pass.
Accepted tokens are emitted; the sequence is extended from where verification succeeded.

For Qwen 3.6 models, this yields approximately 1.5–2× faster generation with no accuracy loss. For Gemma 4, the built-in MTP heads deliver similar gains. The speedup comes purely from doing more work per forward pass — no extra VRAM, no extra compute, just smarter use of what you already have.

Key detail: MTP requires GGUF files built with DCGM (Deep Contextual Generation Mode) or equivalent assistant-head compilation. The standard HuggingFace GGUF files for Qwen 3.6 and Gemma 4 come with these baked in.

2. TurboQuant KV Cache Compression

The KV cache is the silent killer of local LLM inference. Every token processed gets stored as Key and Value vectors, and for a 70B model with a 32K context window, that's tens of gigabytes of KV cache eating VRAM that should be going to model weights.

TurboQuant solves this with a two-stage compression pipeline:

Stage 1 — PolarQuant: The KV embeddings are rotated using a Walsh-Hadamard transform, then converted to polar coordinates. The angles are quantized to ultra-low bit widths. This eliminates the normalization overhead that plagues traditional quantization methods.
Stage 2 — QJL (Quantization via Joint Lookup): A residual encoding layer that eliminates hidden bias from the PolarQuant stage, achieving near-lossless reconstruction.

The result: three quantization levels with dramatically different compression ratios:

Format	Bit Width	Compression Ratio	Best For
`turbo2`	2-bit	6.4×	Extreme long-context, tightest memory budgets
`turbo3`	3-bit	4.6–5.1×	Balanced quality/speed, sweet spot
`turbo4`	4-bit	3.8×	Maximum fidelity with meaningful savings

turbo3 is the workhorse — it compresses the KV cache by roughly 5× while maintaining near-q8_0 prefill speed and ~0.9× decode throughput at long contexts. On Apple Silicon it's been benchmarked at essentially lossless quality at 3-bit.

The real-world impact: running a 70B model with a 100K token context on a single RTX 3090 is now practical. Without TurboQuant, that context window would require over 200GB of VRAM. With turbo3, it fits comfortably on a 24GB card.

3. TriAttention — Smart KV Cache Pruning

TurboQuant compresses what stays. TriAttention decides what stays at all. It's a GPU-accelerated KV cache eviction strategy that:

Scores each token's importance using RoPE-inverted key vectors.
Evicts low-value tokens from the cache during inference.
Works alongside TurboQuant for multiplicative savings.

This is particularly useful for multi-turn conversations or document analysis, where early tokens in a long context window contribute less to subsequent attention calculations.

Building Atomic Llama

The AtomicBot-ai fork is available in three flavors:

Option A: Pre-built Binaries (Easiest)

Head to the releases page and grab the binary for your platform. macOS builds include Metal + BF16 with embedded shader libraries. CUDA builds ship with hardware-optimized kernels.

Option B: Docker (Recommended for Reproducibility)

docker pull dexogen/atomic-llama-cpp-turboquant
docker run -it --gpus all -p 8080:8080 \
  -v ~/models:/models \
  dexogen/atomic-llama-cpp-turboquant \
  llama-server --model /models/Qwen3.6-27B-Instruct-Q4_K_M.gguf \
    --cache-type-k turbo3 --tensor-split 1.0 \
    --mlock -ngl 99

Option C: Build from Source

git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git
cd atomic-llama-cpp-turboquant
mkdir build && cd build
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="80;86;89" \
  -DGGML_CUBLAS=ON
make -j$(nproc) llama-server llama-cli

The key build flags: -DGGML_CUDA=ON enables CUDA kernels (including the TurboQuant-optimized ones), and the CUDA architecture flag should match your GPU. The fork integrates MTP support from upstream PR #22673, so you don't need a separate MTP branch — it's already merged.

Running a Model: Quick Start

Once built, running a model with both MTP and TurboQuant is straightforward:

./llama-server \
  --model Qwen3.6-27B-Instruct-Q4_K_M.gguf \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  -ngl 99 \
  --mlock \
  -t 8 \
  -c 32768 \
  --host 0.0.0.0 \
  --port 8080

Key flags explained:

--cache-type-k turbo3 — KV cache quantization at 3-bit per element.
--cache-type-v turbo3 — Value cache quantization (separate from K).
-ngl 99 — Offload all layers to GPU (maximizes MTP parallelism).
-t 8 — Number of threads for CPU-side work.
-c 32768 — Context window size in tokens.
--mlock — Lock model weights in memory (reduces page faults).

CLI Usage

./llama-cli \
  --model Qwen3.6-27B-Instruct-Q4_K_M.gguf \
  --cache-type-k turbo3 \
  -ngl 99 \
  -p "Write a short story about a robot who discovers poetry." \
  -n 512

Performance Numbers

Here's what the benchmarks look like across different configurations (from community reports and the fork's release notes):

Model	Quant	Standard llama.cpp	Atomic + MTP only	Atomic + MTP + Turbo3
Qwen 3.6 27B	Q4_K_M	~75 t/s	~110 t/s	~105 t/s + 5× context
Qwen 3.6 35B A3B	Q4_K_M	~79 t/s	~92-95 t/s	~88 t/s + 5× context
Gemma 4 72B	Q4_K_M	~35 t/s	~50 t/s	~48 t/s + 5× context
Qwen 3.6 27B	Q4_K_M	~12K context limit	~18K context limit	~100K context (on 24GB)

Notes: throughput numbers are on an RTX 3090 (24GB) or equivalent. The MTP + TurboQuant combination trades a small amount of raw throughput for a massive increase in usable context window — which is often the more valuable optimization in real-world applications.

How It All Fits Together

MTP and TurboQuant attack different bottlenecks, which is why they're complementary:

MTP reduces the number of forward passes needed by drafting multiple tokens per pass. It's a compute-efficiency optimization.
TurboQuant reduces the memory footprint of the KV cache, enabling longer contexts on the same hardware. It's a memory-efficiency optimization.
TriAttention further prunes the cache by evicting low-value tokens, multiplying TurboQuant's benefits.

Together, they turn a GPU that could barely handle a 4K context into one that can comfortably process 100K-token documents — with faster generation to boot. The AtomicBot-ai fork just means you don't have to maintain two separate builds and hope the patches don't conflict.

Gotchas and Caveats

Model compatibility: MTP only works with GGUF files that have assistant heads baked in. Not all quantized models support this — you need Qwen 3.6 or Gemma 4 GGUF files with MTP support. The standard quantization tools on HuggingFace already include assistant heads for these models.

Short context overhead: Below ~1K tokens, TurboQuant's compression savings are negligible, and the rotation + quantization overhead can actually make things slightly slower. TurboQuant really shines at 4K+ context and beyond.

Build matters: If building from source, make sure you're on the atomic-llama-cpp-turboquant main branch — not upstream llama.cpp master. The MTP + TurboQuant integration lives in the AtomicBot fork. Mixing and matching patches from different forks is the #1 source of build failures.

Hardware requirements: TurboQuant's CUDA kernels require NVIDIA GPUs. On Apple Silicon, the Metal builds are solid but the performance gap between turbo2/turbo3/turbo4 narrows slightly compared to NVIDIA. AMD ROCm support is still maturing.

Why This Matters

For the past year, running large language models locally has been a tradeoff: either accept slow generation with short contexts, or buy enterprise hardware. The Atomic Llama forks change the equation by stacking multiple independent optimizations — speculative decoding for speed, low-bit KV compression for memory — into a single drop-in replacement for llama.cpp.

What's particularly impressive about this ecosystem is the community-driven development. After Google published the TurboQuant paper in March 2026, independent implementations appeared within two weeks. The AtomicBot-ai fork took the additional step of combining MTP (merged into upstream llama.cpp via PR #22673) with TurboQuant and TriAttention in a single, well-maintained build that ships Docker images, pre-built binaries for macOS/Linux/CUDA, and even Pinokio integration for one-click deployment.

The result isn't just incremental improvement. It's the difference between "I can run this model" and "I can run this model with a 100K context window at usable speed." That's not a feature upgrade. That's a capability upgrade.

And since it's still llama.cpp at its core — same CLI flags, same GGUF format, same API — you can drop it in without rewriting any of your tooling.

Links

Tags: llama.cpp MTP TurboQuant speculative decoding local LLM KV cache