Wiki / Runtime

Quantization for Home Inference

Quantization shrinks a model's weights from 16-bit floats to lower-precision integers, cutting memory footprint several-fold so large models fit on consumer hardware. 4-bit is the workhorse precision for sovereign home setups.

What quantization does

A model ships its weights as numbers. By default those numbers are 16-bit floating point (FP16/BF16), so a 7-billion-parameter model needs roughly 14 GB just to hold the weights, and a 27B needs over 50 GB. Quantization rewrites those weights at lower numerical precision — most commonly 4-bit integers — so each weight occupies a quarter of the space.

The arithmetic is the whole reason home inference is viable. A 27B model at FP16 will not fit comfortably on a consumer machine; the same 27B quantized to 4-bit drops to roughly 14–16 GB and sits resident on an M4 Max with headroom for the OS and surrounding services. Quantization is the lever that converts "datacenter model" into "runs on hardware you own."

Precision levels and the quality trade-off

Lower precision means a smaller, faster model but a small loss of fidelity, measured as a rise in perplexity. The practical ladder:

  • 8-bit (Q8): near-lossless, but roughly double the footprint of 4-bit — used when you have memory to spare and want maximum quality.
  • 4-bit (Q4): the workhorse. The sweet spot where the footprint drops 4× against FP16 while quality degradation stays small enough to be invisible on most everyday tasks. This is what most sovereign setups run.
  • 3-bit and below: noticeably degraded; reserved for squeezing oversized models onto undersized hardware, accepting real quality loss.

For a 4-bit checkpoint, K-quant variants (e.g. Q4_K_M) are preferred over the older flat Q4_0 — they allocate slightly more precision to the layers that matter most, recovering quality at almost no extra size.

Formats: GGUF and MLX

Two checkpoint ecosystems matter on a Mac:

  • GGUF is the format used by llama.cpp and LM Studio. A filename like Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf encodes the model, the size, and the quantization scheme directly in the name. GGUF is the broadest ecosystem and the easiest to load through an LM Studio server.
  • MLX-quantized checkpoints are quantized specifically for Apple's MLX runtime and served via mlx-lm/mlx-vlm. These extract the most out of Apple Silicon's unified-memory path.

A mixed sovereign stack often runs both: an MLX-quantized vision model on the primary Mac and a GGUF chat model under LM Studio on a second box, each pointed at by the same OpenAI-compatible client code.

Choosing a quantization

The decision is a three-way fit between the model size you want, the memory you have, and the quality floor your tasks tolerate. Start at 4-bit K-quant; it is correct for the overwhelming majority of sovereign workloads. Move to 8-bit only if you have surplus memory and notice quality issues on a specific task. Drop below 4-bit only when a model you really want will not otherwise fit — and verify the output on your real prompts before trusting it, because sub-4-bit failures tend to show up on edge cases rather than in casual testing.