Local Embeddings & Semantic Search
Embeddings turn text into vectors so meaning can be compared by distance, enabling semantic search and retrieval without keyword matching. A small sentence-transformer running locally delivers this at zero marginal cost.
What an embedding is
An embedding is a fixed-length vector of numbers that represents the meaning of a piece of text. Two texts that mean similar things land close together in the vector space; two that mean different things land far apart. This is what lets a system find content by concept rather than by exact words — a query for "how do I keep customers from leaving" can match a document about "churn reduction" even with no shared keywords.
The vector has a fixed dimensionality set by the model. A common, lightweight choice produces 384-dimensional vectors. Every item you want to search — a document, a caption, a record, a chat turn — is embedded once into such a vector and stored; queries are embedded the same way at search time.
Semantic search via cosine similarity
Once text is in vector form, "search" becomes a geometry problem. You embed the query, then rank stored vectors by cosine similarity — the cosine of the angle between the query vector and each candidate. A score near 1.0 means near-identical meaning; near 0 means unrelated. Sorting by that score returns the semantically closest items.
This is the engine under retrieval-augmented generation (RAG), deduplication, recommendation, and "find me things like this." Crucially, it does not require a language model at query time — only the (cheap) embedding model and a similarity computation — which is why it is one of the most cost-effective AI capabilities to run locally.
Running it locally
A practical sovereign embedding service is a small sentence-transformers model such as all-MiniLM-L6-v2, which emits 384-dimensional vectors and runs fast on CPU or Apple Silicon. In the reference stack it is served as an HTTP sidecar on a dedicated local port (e.g. :9447) exposing embed (single) and embedBatch (bulk) operations, with measured throughput on the order of ~80 items/second.
A design choice worth noting: the embedding service is run on demand, not as an always-on daemon — booted when a job needs it and left off otherwise. Client code wraps it behind functions like embed / embedBatch / health and honors an environment variable for the endpoint URL, so the same callers can point at a different host without code changes. Migrating an existing pipeline from a remote embedding API to the local sidecar is a one-line base-URL swap, after which embeddings cost nothing per call.
Storage and scale
For small-to-moderate corpora, vectors live perfectly well inside an ordinary SQLite database alongside the source rows — no dedicated vector database required. Similarity is computed in application code (or a SQL extension) over the stored vectors. This keeps the entire search stack — data, vectors, and query logic — in one file on one machine, fully sovereign.
The ceiling of the simple approach is a brute-force scan: comparing the query against every stored vector is linear in corpus size, which is fine for thousands to low hundreds of thousands of items. Beyond that, an approximate-nearest-neighbor index is the upgrade path. But for the document, caption, and record corpora typical of an owner-operated stack, plain cosine over SQLite-stored 384-dim vectors is both sufficient and effectively free to run.