Wiki / Capabilities

Local Embeddings & Semantic Search

Embeddings turn text into vectors so meaning can be compared by distance, enabling semantic search and retrieval without keyword matching. A small sentence-transformer running locally delivers this at zero marginal cost.

What an embedding is

An embedding is a fixed-length vector of numbers that represents the meaning of a piece of text. Two texts that mean similar things land close together in the vector space; two that mean different things land far apart. This is what lets a system find content by concept rather than by exact words — a query for "how do I keep customers from leaving" can match a document about "churn reduction" even with no shared keywords.

The vector has a fixed dimensionality set by the model. A common, lightweight choice produces 384-dimensional vectors. Every item you want to search — a document, a caption, a record, a chat turn — is embedded once into such a vector and stored; queries are embedded the same way at search time.

Semantic search via cosine similarity

Once text is in vector form, "search" becomes a geometry problem. You embed the query, then rank stored vectors by cosine similarity — the cosine of the angle between the query vector and each candidate. A score near 1.0 means near-identical meaning; near 0 means unrelated. Sorting by that score returns the semantically closest items.

This is the engine under retrieval-augmented generation (RAG), deduplication, recommendation, and "find me things like this." Crucially, it does not require a language model at query time — only the (cheap) embedding model and a similarity computation — which is why it is one of the most cost-effective AI capabilities to run locally.

Running it locally

A practical sovereign embedding service is a small sentence-transformers model such as all-MiniLM-L6-v2, which emits 384-dimensional vectors and runs fast on CPU or Apple Silicon. In the reference stack it is served as an HTTP sidecar on a dedicated local port (e.g. :9447) exposing embed (single) and embedBatch (bulk) operations, with measured throughput on the order of ~80 items/second.

A design choice worth noting: the embedding service is run on demand, not as an always-on daemon — booted when a job needs it and left off otherwise. Client code wraps it behind functions like embed / embedBatch / health and honors an environment variable for the endpoint URL, so the same callers can point at a different host without code changes. Migrating an existing pipeline from a remote embedding API to the local sidecar is a one-line base-URL swap, after which embeddings cost nothing per call.

Storage and scale

For small-to-moderate corpora, vectors live perfectly well inside an ordinary SQLite database alongside the source rows — no dedicated vector database required. Similarity is computed in application code (or a SQL extension) over the stored vectors. This keeps the entire search stack — data, vectors, and query logic — in one file on one machine, fully sovereign.

The ceiling of the simple approach is a brute-force scan: comparing the query against every stored vector is linear in corpus size, which is fine for thousands to low hundreds of thousands of items. Beyond that, an approximate-nearest-neighbor index is the upgrade path. But for the document, caption, and record corpora typical of an owner-operated stack, plain cosine over SQLite-stored 384-dim vectors is both sufficient and effectively free to run.

Local Inference on Apple Silicon (MLX)

MLX is Apple's array framework for machine learning on Apple Silicon, exploiting unified memory so a single M-series Mac can hold and serve large language and vision models with no discrete GPU. It is the runtime backbone of a Mac-based sovereign stack.

Sovereign AI

Sovereign AI is the practice of running inference, embeddings, and AI agents on hardware you own and control, with no per-token cloud dependency in the default path. It treats the model as a fixed asset rather than a metered utility.

The Economics of $0-Marginal Inference

Once you own the hardware, every additional inference call costs only electricity, collapsing the per-token price toward zero. This inverts the cloud's cost curve, where building more always costs more.

Vision Models Run Locally (Qwen2.5-VL)

Qwen2.5-VL is an open vision-language model that reads images and answers questions about them. Run locally via MLX or LM Studio, it provides private, zero-marginal-cost image tagging, captioning, and visual analysis.

Local Embeddings & Semantic Search

What an embedding is

Semantic search via cosine similarity

Running it locally

Storage and scale

Related