Wiki / Runtime

Local Inference on Apple Silicon (MLX)

MLX is Apple's array framework for machine learning on Apple Silicon, exploiting unified memory so a single M-series Mac can hold and serve large language and vision models with no discrete GPU. It is the runtime backbone of a Mac-based sovereign stack.

What MLX is

MLX is Apple's open-source array framework purpose-built for machine learning on Apple Silicon (M-series chips). It plays the role that CUDA + PyTorch play on NVIDIA hardware, but it is designed from the ground up around the unified memory architecture of the M-series SoC, where the CPU and GPU share one physical pool of RAM rather than copying tensors across a PCIe bus.

That single design choice is why a Mac is a credible sovereign inference box. On a discrete-GPU machine, the model has to fit in VRAM, which is small and expensive. On an M4 Max, the model fits in unified memory — the same wide pool the rest of the machine uses — so a checkpoint in the tens of billions of parameters can sit resident with headroom, served straight off the Metal GPU cores.

The unified-memory advantage

On conventional inference hardware, the practical ceiling is the size of the GPU's dedicated VRAM, and getting more of it is disproportionately expensive. Apple Silicon collapses that ceiling: a high-memory M-series Mac can dedicate most of its unified pool to model weights.

In a real sovereign deployment this means an M4 Max can hold a 27-billion-parameter model resident, quantized to 4-bit, with room left for the OS and the surrounding services. There is no VRAM-to-system-RAM swap penalty, no second card to manage, and no driver stack to fight. The model is just a process, and the GPU is just part of the chip it already runs on.

Serving the model

Two patterns dominate on a Mac:

  • MLX directly, via the mlx-lm (text) and mlx-vlm (vision) libraries, typically wrapped behind a small FastAPI shim so agents speak the same OpenAI-style API they used against the cloud. This is the leanest path and the one to reach for when you want full control.
  • LM Studio, which provides an OpenAI-compatible HTTP server (commonly on a local port such as :1234) with a GUI for loading GGUF checkpoints. It is the fastest way to get a compatible endpoint running and is well suited to a second machine in the stack.

Both converge on the same contract: a local http://localhost:<port>/v1/chat/completions endpoint that existing tooling can point at by changing one base URL. That compatibility is what lets a codebase migrate from cloud to sovereign without rewriting its callers.

Performance characteristics

The headline win of local inference is latency: first tokens return in tens of milliseconds because there is no network round-trip and no queue in front of a shared multi-tenant endpoint. Interactive surfaces feel physically faster, which tends to increase usage, which compounds the value of having paid for the box once.

The headline cost is power, not money: a saturated M4 Max under inference draws on the order of a hundred watts. At residential electricity rates a full day of agent traffic costs dimes. The trade-off against the cloud is honest — the frontier hosted models are still smarter than a 27B you run at home — so the durable pattern is to serve the high-volume daily mesh on MLX and reserve a frontier API for the occasional hard reasoning task.