Sovereign AI Glossary — Local Inference & Self-Hosting Terms

Abliterated Model

aka uncensored model, refusal-ablated model, abliterationModels

An open-weight model that has had its built-in refusal behavior surgically removed by identifying and zeroing the internal 'refusal direction' in activation space, rather than by retraining. The result complies with prompts the base model would decline, while otherwise preserving capabilities. It is a sovereignty-side technique: because you hold the weights, you can modify the model's behavior directly. Use is the operator's responsibility under local law.

Apple Silicon

aka M-series, Apple M4 Max, ARM MacLocal Inference

Apple's ARM-based system-on-chip line (M1 through M4 and their Pro/Max/Ultra variants) that integrates CPU, GPU, and Neural Engine on one die with a shared memory controller. For local AI it is significant because its unified memory and high memory bandwidth let large models run on the GPU without a dedicated graphics card. An M4 Max with ample unified memory is a credible single-machine inference server.

Cloudflare Named Tunnel

aka cloudflared tunnel, CF tunnel, named tunnelSelf-hosting & Ops

A persistent, credentialed Cloudflare Tunnel (run by the cloudflared daemon) that connects a local service to Cloudflare's edge through an outbound-only connection, publishing it on a real domain without opening any inbound firewall port. 'Named' means it has a stable UUID and credentials file so it survives restarts and is referenced by config rather than a throwaway URL. It is the standard way to put a self-hosted Mac service on the public internet safely.

Context Window

aka context length, ctxLocal Inference

The maximum number of tokens a model can attend to at once, spanning the prompt plus the generated output. Exceed it and the oldest tokens are truncated or the request is rejected. Larger windows enable longer documents and conversations but cost memory and time, because the KV cache and attention compute scale with window length.

Related:kv cache tokens per sec tokens

Dark Origin

aka hidden origin, unexposed originSelf-hosting & Ops

An origin server whose real IP address is never exposed to the public internet because all traffic is forced through a tunnel or proxy, leaving the machine unreachable directly. With a Cloudflare named tunnel the origin makes only outbound connections, so port scans and direct attacks have nothing to hit. It is the security payoff of the tunnel model: the server is online but not addressable.

Embedding Dimension

aka vector dimension, dimModels

The fixed length of an embedding vector, e.g. 384 for MiniLM or 768/1024 for larger models, which sets both the memory each vector occupies and the resolution of semantic distinctions it can encode. Higher dimensions can capture more nuance but cost more storage and slower comparison. All vectors in one index must share the same dimension to be comparable.

Embeddings

aka vector embeddings, text embeddingsModels

Dense numeric vectors that encode the meaning of text (or images) so that semantic similarity becomes geometric closeness, enabling search, clustering, and retrieval. A local embedding model turns a sentence into a fixed-length vector (e.g. 384 or 768 dimensions) you store and compare with cosine similarity. They are the backbone of semantic search and RAG on a self-hosted stack.

Fine-Tuning

aka finetuning, model adaptationModels

Fine-tuning is the process of continuing a pretrained model's training on a smaller, task-specific dataset so it adapts to a particular domain, tone, or behavior. Full fine-tuning updates every weight and is expensive, which is why home setups usually reach for parameter-efficient methods like LoRA that adjust only a small add-on. The result is a model that answers in your voice or specializes in your data while retaining the general knowledge it learned in pretraining. For a sovereign stack, fine-tuning on hardware you own means the customized weights stay entirely private and permanently yours.

Flash Attention

aka flash-attn, fused attentionLocal Inference

Flash Attention is an optimized way of computing a transformer's attention step that avoids writing the huge intermediate attention matrix to memory, instead calculating it in small tiles kept in fast on-chip memory. The result is the same mathematical output using far less memory bandwidth, which speeds up inference and, crucially, lets a machine hold a longer context window without running out of memory. On a home rig it is often a simple flag that improves both throughput and the practical size of the prompt you can process. It pairs naturally with a quantized KV cache to stretch limited memory further.

GGUF

aka GPT-Generated Unified FormatLocal Inference

A single-file binary format for storing quantized LLM weights plus all metadata (tokenizer, architecture, hyperparameters) needed to load and run them. It is the format consumed by llama.cpp and LM Studio, and the de facto distribution package for self-hosted models. The name succeeded the older GGML format; one .gguf file is everything the runtime needs.

GGUF vs MLX

aka MLX vs GGUFLocal Inference

The two dominant ways to run quantized models on a Mac: GGUF via llama.cpp/LM Studio is cross-platform and format-portable, while MLX is Apple-native and often faster on Apple Silicon because it is built directly on the unified-memory architecture. GGUF wins on portability and tooling breadth; MLX wins on raw Apple-Silicon throughput for supported models. Many sovereign stacks keep both, choosing per model.

GPU Offload

aka layer offload, GPU layers, nglLocal Inference

GPU offload is the practice of loading some or all of a model's layers onto the graphics processor's memory so that inference runs on the fast GPU instead of the slower CPU. In tools like llama.cpp this is set by the number of layers to offload, and fitting the entire model on the GPU eliminates the CPU-layer bottleneck that otherwise caps speed. When a GPU has limited memory, choosing a smaller quantization that lets every layer fit on-device can more than double tokens per second. Balancing quantization level against how many layers fit is the core tuning move for fast home inference.

Host-Header Routing

aka host-based routing, multi-tenant routingSelf-hosting & Ops

Serving multiple distinct websites from one server process or one machine by branching on the HTTP Host header of each request, so the same codebase responds as different sites depending on the domain requested. It is the mechanism behind multi-tenant single-codebase deployments. One Next.js app, many domains, decided per request.

Idempotent Sweep

aka idempotent job, safe re-runSelf-hosting & Ops

A scheduled batch job written so that running it twice produces the same result as running it once, by checking what work is already done before acting. On a sovereign stack it is how a launchd-driven file or asset sweep can re-run every few hours safely, skipping items already tagged or downloaded. Idempotence is what makes 'just run it again' a safe recovery strategy rather than a source of duplicates.

Related:launchd local first

Imatrix Quantization

aka importance matrix quantization, imatrix, IQ quantLocal Inference

Imatrix quantization uses an importance matrix, computed by observing which weights matter most on a sample of real text, to decide where to spend precision when compressing a model to very low bit rates. Guided by this map, aggressive quants such as IQ3 preserve the weights that carry the most meaning and squeeze the rest, so a model can shrink dramatically while holding onto surprising quality. The practical win is fitting a whole model onto a small GPU at a size a naive quant could not match without heavy quality loss. It is what makes ultra-low-bit local inference viable on modest hardware.

KV Cache

aka KV-cache, attention cacheLocal Inference

The key-value cache stores the attention keys and values computed for every token already in the context so they are never recomputed during generation. It is what makes autoregressive decoding fast, but it grows linearly with context length and consumes real memory, often the dominant non-weight consumer of VRAM during a long session. Drop the cache and generation speed collapses; let it grow unbounded and you run out of memory.

launchd

aka LaunchAgent, LaunchDaemon, plist jobSelf-hosting & Ops

macOS's native service manager that starts, supervises, and restarts background processes and scheduled jobs via plist definitions in LaunchAgents/LaunchDaemons. It is the Mac equivalent of systemd and the correct way to keep a sovereign stack's inference servers, tunnels, and sweeps running across reboots and crashes. Jobs can run on a schedule (StartInterval/StartCalendarInterval) or be kept alive (KeepAlive).

llama.cpp

aka llamacppLocal Inference

The C/C++ inference engine that pioneered efficient CPU-and-GPU LLM serving from quantized GGUF weights, and the runtime under many local AI tools including LM Studio. It introduced the GGUF format and the K-quant schemes, and supports Metal acceleration on Apple Silicon. If you are running a GGUF model locally, llama.cpp or a wrapper around it is usually doing the work.

LM Studio

aka LMStudioLocal Inference

A desktop application for downloading, managing, and serving GGUF models locally, exposing them through an OpenAI-compatible HTTP API so existing tooling can point at localhost instead of a cloud provider. It is a common way to run Qwen and other models on a self-hosted box without writing serving code. The OpenAI-compatible endpoint is what makes it a drop-in replacement for hosted inference.

Local Inference

aka on-device inference, edge inferenceLocal Inference

Running a model's forward pass on local hardware instead of calling a remote endpoint. The prompt is tokenized, pushed through the loaded weights on your CPU/GPU/NPU, and decoded into output without a network round trip. Latency becomes a function of your own silicon rather than a provider's queue, and cost drops to electricity.

Local-First

aka local-first software, offline-firstSovereignty

A design philosophy in which the primary copy of data and the primary compute live on the user's own machine, with the network treated as optional rather than required. For AI it means defaulting to local models and falling back to remote services only when necessary, keeping data private and operation possible offline. The stack works first on hardware you own, and reaches out only by choice.

LoRA (Low-Rank Adaptation)

aka low-rank adaptation, LoRA adapterModels

LoRA, short for Low-Rank Adaptation, is a technique for customizing a large model by training a small pair of add-on matrices while leaving the original billions of weights frozen. Because only these compact adapters are updated, fine-tuning becomes feasible on a single consumer GPU and produces a file of megabytes rather than gigabytes. The adapter can be merged into the base model or loaded and swapped at inference time, letting one base model wear many specialized personalities. It is the standard way to teach a local model a specific voice, style, or task without the cost of full fine-tuning.

Mixture of Experts (MoE)

aka MoE, sparse expert modelModels

Mixture of Experts is a model architecture that contains many specialized sub-networks, called experts, but activates only a small subset of them for any given token. This means a model can hold a very large total parameter count while only paying the compute cost of the few experts a router selects at each step. The payoff for local inference is strong quality at a fraction of the runtime of a comparably sized dense model. A small routing network learns which experts to consult, so the model effectively specializes on the fly without running its full weight count every time.

MLX

aka Apple MLX, mlx-lmLocal Inference

Apple's open-source array framework for machine learning on Apple Silicon, designed to exploit the unified memory architecture so models share one memory pool across CPU and GPU with no copy. It powers fast local LLM and vision-model inference on M-series Macs and is the backbone of running Qwen2.5-VL natively on an M4 Max. NumPy-like API, lazy evaluation, and first-class Metal acceleration.

Model Distillation

aka knowledge distillation, teacher-student training, distilled modelModels

Model distillation is a training method in which a smaller student model learns to imitate the outputs, and often the reasoning traces, of a larger and more capable teacher model. The goal is to compress much of the big model's ability into a form that runs fast and cheap on modest hardware. Distilled reasoning models, for example, capture a frontier model's step-by-step problem solving in a 7-billion-parameter body that fits on a home GPU. It is a key reason a sovereign setup can run genuinely strong local models without frontier-scale infrastructure.

MPS

aka Metal Performance Shaders, mps deviceLocal Inference

Metal Performance Shaders, Apple's GPU-acceleration backend that PyTorch and other frameworks target to run tensor operations on Apple Silicon GPUs instead of CPU or CUDA. Selecting the 'mps' device is how you move local PyTorch inference onto the Mac GPU. It is the bridge that lets cross-platform ML code, not just MLX-native code, exploit Apple hardware.

One Mac, Three Domains

aka single-Mac multi-domain, one-box hostingSelf-hosting & Ops

The sovereign deployment pattern where a single Mac runs one application bound to several ports under launchd, and a Cloudflare named tunnel host-routes multiple apex domains to those ports, so several independent sites live on one machine with no VPS. It is a concrete instance of host-header routing plus tunneling plus process supervision. The whole public footprint is one computer you can hold in your hands.

OpenAI-Compatible API

aka OpenAI-compatible endpoint, drop-in APILocal Inference

A local HTTP endpoint that speaks the same request/response shape as OpenAI's chat-completions API, so any client or SDK written for OpenAI can be repointed at your own server by changing the base URL. LM Studio and llama.cpp both expose one, which is what lets a sovereign stack slot into existing tooling with zero code changes. The compatibility is the migration path off hosted inference.

Own Your Weights

aka own-your-model, open weightsSovereignty

The principle that the model parameters themselves sit on your disk, so the model cannot be deprecated, rate-limited, price-changed, or silently altered by a vendor. Open-weight releases make this possible; you can run, fine-tune, quantize, or ablate the exact checkpoint indefinitely. It is the durability guarantee that hosted APIs cannot offer.

Q4_K_M

aka Q4_K_M GGUF, 4-bit K-quant mediumLocal Inference

A specific 4-bit quantization scheme in the GGUF/llama.cpp K-quant family, where 'M' denotes a medium mix that keeps certain sensitive tensors (like attention and feed-forward layers) at higher precision while quantizing the rest to 4 bits. It is the most common default for self-hosted 7B-class models because it hits a strong balance of size, speed, and retained quality. Roughly 4-5GB for a 7B model.

Related:quantization gguf qwen25 vl

Quantization

aka weight quantization, model compressionLocal Inference

Compressing model weights from high-precision floats (FP16/BF16) down to lower-bit integers (8-, 5-, 4-, even 2-bit) so the model fits in less memory and runs faster. A 7B model at FP16 needs ~14GB; quantized to 4-bit it drops to ~4GB, making it loadable on consumer hardware. The trade is a small, usually acceptable loss in output quality for a large gain in footprint and speed.

Related:q4 k m gguf vram unified memory

Qwen2.5-VL

aka Qwen2.5-VL-7B, Qwen VLModels

Alibaba's open-weight vision-language model family that accepts both images and text, used for captioning, tagging, OCR, and visual question answering on a local stack. The 7B variant runs comfortably on Apple Silicon via MLX or as a GGUF in LM Studio, making it a practical sovereign alternative to hosted multimodal APIs. 'VL' marks the vision-language multimodal variant as distinct from the text-only Qwen models.

Reasoning Model

aka chain-of-thought model, thinking model, CoT modelModels

A reasoning model is a language model trained to work through a problem in explicit intermediate steps, a chain of thought, before committing to a final answer. This visible deliberation, often emitted into a separate reasoning field, sharply improves accuracy on math, logic, and multi-step tasks at the cost of generating more tokens and taking longer. Consumers typically read only the clean final answer while the model's scratch work stays in the reasoning stream. Running a distilled reasoning model locally gives a sovereign stack frontier-style rigor without sending any prompt to a cloud API.

Retrieval-Augmented Generation (RAG)

aka RAG, retrieval augmentation, grounded generationLocal Inference

Retrieval-augmented generation, or RAG, is a pattern that gives a model access to an external knowledge base at query time instead of relying only on what it memorized during training. The user's question is embedded, the most relevant documents are pulled from a vector store, and those snippets are inserted into the prompt so the model answers from real, current source material. This grounds responses in your own data, reduces hallucination, and lets a smaller local model punch above its weight on specialized topics. Built entirely on a home stack with local embeddings and a local model, RAG keeps both the knowledge and the queries private.

Reverse Proxy

aka reverse-proxy, host-based routingSelf-hosting & Ops

A server that sits in front of one or more backend services and routes incoming requests to the right one, typically by hostname or path, while terminating TLS and adding headers. In a one-machine, many-sites setup it is what lets a single box serve multiple domains by inspecting the Host header. Cloudflare's edge plus a local proxy is a common sovereign pattern for host-based routing.

Sentence-Transformers

aka SBERT, all-MiniLM-L6-v2Models

A Python library and model family (such as all-MiniLM-L6-v2) purpose-built to produce sentence and paragraph embeddings rather than per-token outputs. It is the standard way to stand up a fast, lightweight local embedding service; MiniLM emits 384-dimensional vectors and runs hundreds of items per second on modest hardware. Pairs naturally with a local vector store for sovereign semantic search.

Sovereign AI

aka self-hosted AI, own-your-stack AISovereignty

AI that runs entirely on hardware you own and control, with no dependency on a third-party API, cloud GPU rental, or per-token billing. Weights live on your disk, inference happens in your own process, and no prompt or output leaves the machine. The sovereignty is operational, not ideological: you can pull the network cable and the model still works.

Temperature

aka sampling temperature, tempLocal Inference

Temperature is a sampling parameter that controls how random a model's next-token choices are when it generates text. A low temperature makes the model favor its most probable tokens, producing focused, deterministic, repeatable output that suits factual and structured tasks; a higher temperature flattens the probabilities so less likely tokens get picked, yielding more varied and creative but less predictable results. Setting it to zero makes generation effectively greedy and reproducible. It is the primary dial an operator turns to trade precision against creativity in a local model.

Tokens

aka subword tokensLocal Inference

The sub-word units a model reads and writes; text is split by a tokenizer into tokens before inference, where one token is roughly 0.75 English words. Model limits, throughput, and (in hosted services) pricing are all denominated in tokens, not words or characters. On a sovereign stack tokens cost nothing but compute, but they still determine how much fits in the context window.

Tokens per Second

aka tok/s, t/s, throughputLocal Inference

The throughput metric for local inference, measuring how many tokens the model decodes per second. It is split into prompt-processing speed (how fast it ingests your input) and generation speed (how fast it writes the answer). On Apple Silicon a 7B Q4 model commonly runs tens of tokens/sec; it is the single number that tells you whether a model is usable on your hardware.

Unified Memory

aka UMA, unified memory architectureLocal Inference

A single physical memory pool shared by the CPU, GPU, and Neural Engine on Apple Silicon, eliminating the copy-across-the-PCIe-bus penalty of discrete GPUs. Because the GPU can address the full pool, a 64GB or 128GB Mac can load models that would need an expensive multi-GPU rig on the PC side. It is the architectural reason Macs are punchy local-inference machines.

Related:apple silicon vram mlx mps

Vision-Language Model

aka VLM, multimodal modelModels

A multimodal model that takes images alongside text and reasons over both, enabling captioning, visual Q&A, document understanding, and image tagging. Architecturally it pairs a vision encoder with a language model so pixels and tokens share one reasoning space. Run locally, it replaces cloud vision APIs for tagging and OCR work with zero per-call cost.

VRAM

aka GPU memory, graphics memoryLocal Inference

Video RAM, the memory directly accessible to the GPU, which on a discrete-GPU machine is the hard ceiling on what model and context you can hold. A model that does not fit in VRAM must spill to system RAM or disk and slows dramatically. On Apple Silicon there is no separate VRAM pool; the GPU draws from unified memory instead, which is why a high-RAM Mac punches above its weight for local inference.

Zero Open Ports

aka no inbound ports, outbound-onlySelf-hosting & Ops

A network posture in which the host accepts no inbound connections at all, with no listening port exposed to the internet, because public traffic arrives only through an outbound-initiated tunnel. The firewall can deny all inbound and the service is still reachable via the edge. It collapses the attack surface to nearly nothing while keeping the site live.

Zero-Marginal Inference

aka $0-marginal inference, zero marginal costSovereignty

The economic property of a self-hosted stack where, after the hardware is bought, each additional inference costs only electricity rather than a per-token fee. There is no metered API bill, so high-volume workloads like batch tagging or embedding a whole corpus become effectively free. It changes which automations are worth building, because volume no longer maps to spend.