Sovereign AGI/ASI · Glossary

Glossary

The vocabulary of the sovereign stack — local inference, quantization, and self-hosting, defined.

35 terms
Abliterated Model
aka uncensored model, refusal-ablated model, abliterationModels
An open-weight model that has had its built-in refusal behavior surgically removed by identifying and zeroing the internal 'refusal direction' in activation space, rather than by retraining. The result complies with prompts the base model would decline, while otherwise preserving capabilities. It is a sovereignty-side technique: because you hold the weights, you can modify the model's behavior directly. Use is the operator's responsibility under local law.
Apple Silicon
aka M-series, Apple M4 Max, ARM MacLocal Inference
Apple's ARM-based system-on-chip line (M1 through M4 and their Pro/Max/Ultra variants) that integrates CPU, GPU, and Neural Engine on one die with a shared memory controller. For local AI it is significant because its unified memory and high memory bandwidth let large models run on the GPU without a dedicated graphics card. An M4 Max with ample unified memory is a credible single-machine inference server.
Cloudflare Named Tunnel
aka cloudflared tunnel, CF tunnel, named tunnelSelf-hosting & Ops
A persistent, credentialed Cloudflare Tunnel (run by the cloudflared daemon) that connects a local service to Cloudflare's edge through an outbound-only connection, publishing it on a real domain without opening any inbound firewall port. 'Named' means it has a stable UUID and credentials file so it survives restarts and is referenced by config rather than a throwaway URL. It is the standard way to put a self-hosted Mac service on the public internet safely.
Context Window
aka context length, ctxLocal Inference
The maximum number of tokens a model can attend to at once, spanning the prompt plus the generated output. Exceed it and the oldest tokens are truncated or the request is rejected. Larger windows enable longer documents and conversations but cost memory and time, because the KV cache and attention compute scale with window length.
Dark Origin
aka hidden origin, unexposed originSelf-hosting & Ops
An origin server whose real IP address is never exposed to the public internet because all traffic is forced through a tunnel or proxy, leaving the machine unreachable directly. With a Cloudflare named tunnel the origin makes only outbound connections, so port scans and direct attacks have nothing to hit. It is the security payoff of the tunnel model: the server is online but not addressable.
Embedding Dimension
aka vector dimension, dimModels
The fixed length of an embedding vector, e.g. 384 for MiniLM or 768/1024 for larger models, which sets both the memory each vector occupies and the resolution of semantic distinctions it can encode. Higher dimensions can capture more nuance but cost more storage and slower comparison. All vectors in one index must share the same dimension to be comparable.
Embeddings
aka vector embeddings, text embeddingsModels
Dense numeric vectors that encode the meaning of text (or images) so that semantic similarity becomes geometric closeness, enabling search, clustering, and retrieval. A local embedding model turns a sentence into a fixed-length vector (e.g. 384 or 768 dimensions) you store and compare with cosine similarity. They are the backbone of semantic search and RAG on a self-hosted stack.
GGUF
aka GPT-Generated Unified FormatLocal Inference
A single-file binary format for storing quantized LLM weights plus all metadata (tokenizer, architecture, hyperparameters) needed to load and run them. It is the format consumed by llama.cpp and LM Studio, and the de facto distribution package for self-hosted models. The name succeeded the older GGML format; one .gguf file is everything the runtime needs.
GGUF vs MLX
aka MLX vs GGUFLocal Inference
The two dominant ways to run quantized models on a Mac: GGUF via llama.cpp/LM Studio is cross-platform and format-portable, while MLX is Apple-native and often faster on Apple Silicon because it is built directly on the unified-memory architecture. GGUF wins on portability and tooling breadth; MLX wins on raw Apple-Silicon throughput for supported models. Many sovereign stacks keep both, choosing per model.
Host-Header Routing
aka host-based routing, multi-tenant routingSelf-hosting & Ops
Serving multiple distinct websites from one server process or one machine by branching on the HTTP Host header of each request, so the same codebase responds as different sites depending on the domain requested. It is the mechanism behind multi-tenant single-codebase deployments. One Next.js app, many domains, decided per request.
Idempotent Sweep
aka idempotent job, safe re-runSelf-hosting & Ops
A scheduled batch job written so that running it twice produces the same result as running it once, by checking what work is already done before acting. On a sovereign stack it is how a launchd-driven file or asset sweep can re-run every few hours safely, skipping items already tagged or downloaded. Idempotence is what makes 'just run it again' a safe recovery strategy rather than a source of duplicates.
KV Cache
aka KV-cache, attention cacheLocal Inference
The key-value cache stores the attention keys and values computed for every token already in the context so they are never recomputed during generation. It is what makes autoregressive decoding fast, but it grows linearly with context length and consumes real memory, often the dominant non-weight consumer of VRAM during a long session. Drop the cache and generation speed collapses; let it grow unbounded and you run out of memory.
launchd
aka LaunchAgent, LaunchDaemon, plist jobSelf-hosting & Ops
macOS's native service manager that starts, supervises, and restarts background processes and scheduled jobs via plist definitions in LaunchAgents/LaunchDaemons. It is the Mac equivalent of systemd and the correct way to keep a sovereign stack's inference servers, tunnels, and sweeps running across reboots and crashes. Jobs can run on a schedule (StartInterval/StartCalendarInterval) or be kept alive (KeepAlive).
llama.cpp
aka llamacppLocal Inference
The C/C++ inference engine that pioneered efficient CPU-and-GPU LLM serving from quantized GGUF weights, and the runtime under many local AI tools including LM Studio. It introduced the GGUF format and the K-quant schemes, and supports Metal acceleration on Apple Silicon. If you are running a GGUF model locally, llama.cpp or a wrapper around it is usually doing the work.
LM Studio
aka LMStudioLocal Inference
A desktop application for downloading, managing, and serving GGUF models locally, exposing them through an OpenAI-compatible HTTP API so existing tooling can point at localhost instead of a cloud provider. It is a common way to run Qwen and other models on a self-hosted box without writing serving code. The OpenAI-compatible endpoint is what makes it a drop-in replacement for hosted inference.
Local Inference
aka on-device inference, edge inferenceLocal Inference
Running a model's forward pass on local hardware instead of calling a remote endpoint. The prompt is tokenized, pushed through the loaded weights on your CPU/GPU/NPU, and decoded into output without a network round trip. Latency becomes a function of your own silicon rather than a provider's queue, and cost drops to electricity.
Local-First
aka local-first software, offline-firstSovereignty
A design philosophy in which the primary copy of data and the primary compute live on the user's own machine, with the network treated as optional rather than required. For AI it means defaulting to local models and falling back to remote services only when necessary, keeping data private and operation possible offline. The stack works first on hardware you own, and reaches out only by choice.
MLX
aka Apple MLX, mlx-lmLocal Inference
Apple's open-source array framework for machine learning on Apple Silicon, designed to exploit the unified memory architecture so models share one memory pool across CPU and GPU with no copy. It powers fast local LLM and vision-model inference on M-series Macs and is the backbone of running Qwen2.5-VL natively on an M4 Max. NumPy-like API, lazy evaluation, and first-class Metal acceleration.
MPS
aka Metal Performance Shaders, mps deviceLocal Inference
Metal Performance Shaders, Apple's GPU-acceleration backend that PyTorch and other frameworks target to run tensor operations on Apple Silicon GPUs instead of CPU or CUDA. Selecting the 'mps' device is how you move local PyTorch inference onto the Mac GPU. It is the bridge that lets cross-platform ML code, not just MLX-native code, exploit Apple hardware.
One Mac, Three Domains
aka single-Mac multi-domain, one-box hostingSelf-hosting & Ops
The sovereign deployment pattern where a single Mac runs one application bound to several ports under launchd, and a Cloudflare named tunnel host-routes multiple apex domains to those ports, so several independent sites live on one machine with no VPS. It is a concrete instance of host-header routing plus tunneling plus process supervision. The whole public footprint is one computer you can hold in your hands.
OpenAI-Compatible API
aka OpenAI-compatible endpoint, drop-in APILocal Inference
A local HTTP endpoint that speaks the same request/response shape as OpenAI's chat-completions API, so any client or SDK written for OpenAI can be repointed at your own server by changing the base URL. LM Studio and llama.cpp both expose one, which is what lets a sovereign stack slot into existing tooling with zero code changes. The compatibility is the migration path off hosted inference.
Own Your Weights
aka own-your-model, open weightsSovereignty
The principle that the model parameters themselves sit on your disk, so the model cannot be deprecated, rate-limited, price-changed, or silently altered by a vendor. Open-weight releases make this possible; you can run, fine-tune, quantize, or ablate the exact checkpoint indefinitely. It is the durability guarantee that hosted APIs cannot offer.
Q4_K_M
aka Q4_K_M GGUF, 4-bit K-quant mediumLocal Inference
A specific 4-bit quantization scheme in the GGUF/llama.cpp K-quant family, where 'M' denotes a medium mix that keeps certain sensitive tensors (like attention and feed-forward layers) at higher precision while quantizing the rest to 4 bits. It is the most common default for self-hosted 7B-class models because it hits a strong balance of size, speed, and retained quality. Roughly 4-5GB for a 7B model.
Quantization
aka weight quantization, model compressionLocal Inference
Compressing model weights from high-precision floats (FP16/BF16) down to lower-bit integers (8-, 5-, 4-, even 2-bit) so the model fits in less memory and runs faster. A 7B model at FP16 needs ~14GB; quantized to 4-bit it drops to ~4GB, making it loadable on consumer hardware. The trade is a small, usually acceptable loss in output quality for a large gain in footprint and speed.
Qwen2.5-VL
aka Qwen2.5-VL-7B, Qwen VLModels
Alibaba's open-weight vision-language model family that accepts both images and text, used for captioning, tagging, OCR, and visual question answering on a local stack. The 7B variant runs comfortably on Apple Silicon via MLX or as a GGUF in LM Studio, making it a practical sovereign alternative to hosted multimodal APIs. 'VL' marks the vision-language multimodal variant as distinct from the text-only Qwen models.
Reverse Proxy
aka reverse-proxy, host-based routingSelf-hosting & Ops
A server that sits in front of one or more backend services and routes incoming requests to the right one, typically by hostname or path, while terminating TLS and adding headers. In a one-machine, many-sites setup it is what lets a single box serve multiple domains by inspecting the Host header. Cloudflare's edge plus a local proxy is a common sovereign pattern for host-based routing.
Sentence-Transformers
aka SBERT, all-MiniLM-L6-v2Models
A Python library and model family (such as all-MiniLM-L6-v2) purpose-built to produce sentence and paragraph embeddings rather than per-token outputs. It is the standard way to stand up a fast, lightweight local embedding service; MiniLM emits 384-dimensional vectors and runs hundreds of items per second on modest hardware. Pairs naturally with a local vector store for sovereign semantic search.
Sovereign AI
aka self-hosted AI, own-your-stack AISovereignty
AI that runs entirely on hardware you own and control, with no dependency on a third-party API, cloud GPU rental, or per-token billing. Weights live on your disk, inference happens in your own process, and no prompt or output leaves the machine. The sovereignty is operational, not ideological: you can pull the network cable and the model still works.
Tokens
aka subword tokensLocal Inference
The sub-word units a model reads and writes; text is split by a tokenizer into tokens before inference, where one token is roughly 0.75 English words. Model limits, throughput, and (in hosted services) pricing are all denominated in tokens, not words or characters. On a sovereign stack tokens cost nothing but compute, but they still determine how much fits in the context window.
Tokens per Second
aka tok/s, t/s, throughputLocal Inference
The throughput metric for local inference, measuring how many tokens the model decodes per second. It is split into prompt-processing speed (how fast it ingests your input) and generation speed (how fast it writes the answer). On Apple Silicon a 7B Q4 model commonly runs tens of tokens/sec; it is the single number that tells you whether a model is usable on your hardware.
Unified Memory
aka UMA, unified memory architectureLocal Inference
A single physical memory pool shared by the CPU, GPU, and Neural Engine on Apple Silicon, eliminating the copy-across-the-PCIe-bus penalty of discrete GPUs. Because the GPU can address the full pool, a 64GB or 128GB Mac can load models that would need an expensive multi-GPU rig on the PC side. It is the architectural reason Macs are punchy local-inference machines.
Vision-Language Model
aka VLM, multimodal modelModels
A multimodal model that takes images alongside text and reasons over both, enabling captioning, visual Q&A, document understanding, and image tagging. Architecturally it pairs a vision encoder with a language model so pixels and tokens share one reasoning space. Run locally, it replaces cloud vision APIs for tagging and OCR work with zero per-call cost.
VRAM
aka GPU memory, graphics memoryLocal Inference
Video RAM, the memory directly accessible to the GPU, which on a discrete-GPU machine is the hard ceiling on what model and context you can hold. A model that does not fit in VRAM must spill to system RAM or disk and slows dramatically. On Apple Silicon there is no separate VRAM pool; the GPU draws from unified memory instead, which is why a high-RAM Mac punches above its weight for local inference.
Zero Open Ports
aka no inbound ports, outbound-onlySelf-hosting & Ops
A network posture in which the host accepts no inbound connections at all, with no listening port exposed to the internet, because public traffic arrives only through an outbound-initiated tunnel. The firewall can deny all inbound and the service is still reachable via the edge. It collapses the attack surface to nearly nothing while keeping the site live.
Zero-Marginal Inference
aka $0-marginal inference, zero marginal costSovereignty
The economic property of a self-hosted stack where, after the hardware is bought, each additional inference costs only electricity rather than a per-token fee. There is no metered API bill, so high-volume workloads like batch tagging or embedding a whole corpus become effectively free. It changes which automations are worth building, because volume no longer maps to spend.