Trace Your Family Tree With SQLite and Cosine Similarity

Past 1870, genealogy stops being a search problem and becomes a record-linkage problem. The federal census quit naming my ancestors in 1860 — before emancipation, enslaved people were tallied as hash marks on slave schedules, counted by age and sex, never by name. Every consumer ancestry product I tried hit that wall and handed me a "we couldn't find more records" page, which is a polite way of saying the engineering got hard and the subscription gave up.

So I stopped renting the search and built the database. Code Black is a plain SQLite file on a machine I own — 229 persons and their relationships, pulled from slave schedules, Freedmen's Bureau registers, censuses, and estate ledgers that were never designed to join cleanly. Getting them to join is the whole job, and the job is record linkage: a database join plus a similarity score you can run yourself. No magic — just a schema, a normalized name column, and cosine similarity doing the work a string match can't.

Why The SaaS Quits At 1870 — And Why I Made It A Database

The big ancestry platforms are tuned for the easy 90%: well-indexed, name-rich, post-Reconstruction records where a surname match is enough. That covers most American families. It does not cover mine. The records I need — 1850 and 1860 slave schedules, Freedmen's Bureau labor contracts, enslaver estate inventories — are sparse, handwritten, inconsistently spelled, and split across archives that don't share keys.

A subscription product can't sit with that ambiguity. It needs a clean hit to show you a leaf and keep you paying, so when the records get noisy it shows you nothing — "six candidates at 40% confidence each" is bad UX for a consumer funnel. But that list of fuzzy candidates is the real research, where a human holding the full network of who-lived-near-whom makes the call. A database doesn't have a funnel; it has rows. The platform's failure mode is silence. Mine is a ranked shortlist I get to keep, score, and sharpen as new records come in.

The Schema: Persons, Relationships, Sources

The structure is deliberately boring, because boring survives. Three core tables.

persons — one row per human: name, birth/death year ranges, places, a natural_key for identity, and a norm_name (normalized name) column that is the linchpin of the system. I also flag whether a person appears as enslaved or as an enslaver, because that context changes how I read every other field.
relationships — the edges. A row says person A is parent/child/spouse/sibling of B, with a confidence value and a pointer to the evidence. Kinship as its own table — not columns on persons — is what makes a bloodline a graph instead of a flat list.
sources — provenance. Every fact ties back to a record: which register, which archive ID, which page. A claim with no source isn't a fact; it's a guess waiting to be falsified.

That's it. A bloodline is rows, and the norm_name column is where the hard part lives.

The Hard Problem: One Ancestor, Six Spellings

Here is the actual obstacle. The same woman shows up as Mintey, Minty, Minta, Araminta, Mintie, and once as just an unnamed age-and-sex tally on a schedule. A clerk in 1850, a Bureau agent in 1866, and a census taker in 1870 each spelled the name however it landed in their ear. Surnames are worse — enslaved people were recorded under an enslaver's name, then often took a different one at freedom, so the surname thread can snap entirely at 1865.

A literal string match — WHERE name = 'Mintey' — finds one of those six and silently drops the rest. That's the failure that ends a family line. So linkage can't be equality. It has to be similarity across three axes at once:

Name — fuzzy, because spelling is noise. Normalize aggressively (lowercase, strip punctuation, collapse nicknames) into norm_name, then compare.
Place — a county, a parish, an estate. Two "Minty" records 600 miles apart are probably two women; two in the same Louisiana parish are probably one.
Date — a birth-year window. Records give ranges, not points, so I compare overlapping ranges, not exact years.

A match is a weighted agreement across all three. Name alone lies; name plus place plus a plausible age is a candidate worth ranking.

Embeddings And Cosine Similarity For Record Linkage

Fuzzy string distance (edit distance, Soundex) handles spelling drift but it's blind to context. To rank candidates properly I turn each record into a vector — the same embedding-plus-cosine trick I use everywhere in my stack, pointed at genealogy.

The pipeline is small and local. I run a 384-dimension sentence-transformer embedding model on-device, zero per-call cost, nothing leaving the house. For each record I build a short text blob (name, places, years, role) and embed it into a vector stored beside the data. To match a target person I embed their blob the same way and compute cosine similarity against the pool: the cosine of the angle between two vectors, 1.0 for identical, dropping toward 0 as they diverge. Rank by that score and the six spellings of Mintey cluster at the top while a coincidental same-name stranger in another state sinks.

The payoff beyond spelling is the FAN network — Friends, Associates, and Neighbors. You identify a person not in isolation but by the cluster who recur around them: the same enslaver, the same witnesses, the same neighbors. A string match can't see that an estate inventory and a labor contract describe the same household; embeddings can, because the surrounding names and places pull the vectors close even when the target's own name is spelled differently or missing. Cosine similarity surfaces the tie the string match walks right past — which is how a hash mark on a schedule gets connected to a named woman in a Bureau register.

Keeping It Honest: Confidence, Flags, No Fabrication

A similarity score is a hypothesis, not a verdict, and the fastest way to ruin a family tree is to let a high cosine number auto-promote into a hard fact. So the system stays skeptical of itself.

Every relationship row carries a confidence value — high, medium, low — based on how many of the three axes agree and how strong the source is. A medium-low link goes into the tree marked medium-low, never laundered into certainty. When the records don't support a connection, the answer is no connection, not a plausible bridge invented to close a gap. I've caught my own bugs this way — a column-scramble that mis-wired over a hundred relationships, duplicate persons from an upsert matching the wrong key — because every claim is auditable back to a source, so contradictions surface instead of hiding.

The rule is absolute: the database may rank, suggest, and cluster, but it does not get to fabricate a bloodline. If the evidence isn't there, the row says so. A family tree that lies to you is worse than one with honest blanks.

Why Owning The Data Matters

These are my ancestors. The idea that the canonical record of my family — names clawed back from hash marks, parish links, enslaver ledgers cross-referenced over weeks — would live inside a platform that can paywall it, kill the export, or change terms on me is unacceptable. That's not an archive; that's a hostage.

A SQLite file is the opposite: one portable file, version-controlled, backed up, readable by anything, owned outright. The embeddings sit next to it, the whole engine runs on my desk for zero marginal cost, no platform sees the corpus, and if a tool dies tomorrow the data outlives it untouched. Record linkage is engineering, and engineering you can run yourself is engineering nobody can take back. For a history someone already tried once to erase, ownership isn't a feature. It's the entire point.

FAQ

Do I really need embeddings, or is fuzzy string matching enough?

Fuzzy string matching handles spelling drift on a single name, and you should start there. But it can't see the FAN network — the cluster of recurring enslavers, witnesses, and neighbors that identifies a person across documents. Embeddings plus cosine similarity capture that context, which is where the hardest pre-1870 links live.

Why SQLite instead of a real graph database?

A bloodline of a few hundred persons fits comfortably in SQLite, and a relationships table models the graph fine at this scale. SQLite is one portable file you fully own, with no server to run and no vendor to depend on. Reach for a dedicated graph store only when you're well past thousands of nodes, which a family tree rarely is.

How do you avoid inventing connections that aren't real?

Every relationship carries a confidence value and a pointer to its source record, so nothing enters the tree unaudited. A high similarity score is treated as a hypothesis to verify, never an automatic fact, and low-confidence links stay flagged as low-confidence. When the records don't support a link, the honest answer is no link.

Can someone use this for their own family, not just Black American genealogy?

Yes. Record linkage — a join plus a similarity score across name, place, and date — is general. Any research where the same person appears spelled differently across messy, un-joined records benefits from the same schema and the same embedding-plus-cosine approach. The pre-1870 wall is the hardest case, which is exactly why it's the best test of the method.