Wiki / Economics

The Economics of $0-Marginal Inference

Once you own the hardware, every additional inference call costs only electricity, collapsing the per-token price toward zero. This inverts the cloud's cost curve, where building more always costs more.

The core idea

On a metered API, the price of inference is per token: every call has a marginal cost, and the bill grows with usage forever. On owned hardware, the marginal cost of one more call is just the electricity to compute it — fractions of a cent — so the effective per-token price collapses toward zero once the machine is paid for.

This is the whole economic argument for sovereign AI in one sentence: you convert a variable cost that scales with success into a fixed cost you pay once. After the break-even point, additional usage is essentially free.

Capex versus opex

The cloud is pure opex (operating expense): nothing up front, a recurring meter forever. Sovereign AI is mostly capex (capital expense): a one-time hardware purchase, then negligible running cost. The decision is the classic rent-versus-own trade.

For a saturated Apple Silicon inference box, the recurring side is dominated by power draw on the order of a hundred watts under load — meaning a full day of agent traffic costs dimes, not dollars. There is no Kubernetes bill, no GPU rental, no per-seat SaaS, and no subscription waiting to be renegotiated. The opex line that defined the cloud setup essentially disappears, replaced by a sunk hardware cost that does not grow.

The inverted incentive

The most important consequence is behavioral, not financial. On a meter, the incentive structure is upside down: building more means paying more, forever. Every new agent you deploy enlarges the bill; every feature that calls the model enlarges it again. Success is taxed, and the tax compounds as the system gets richer.

Sovereign infrastructure inverts that. Building more is free. Running it is free. A developer stops rationing inference and starts using it liberally — more agent loops, more summarization, more classification, more interactive surfaces — precisely because each additional call no longer shows up on a bill. The architecture rewards the behavior you actually want.

Break-even and the honest split

Break-even is a workload calculation: divide the hardware cost by your avoided monthly API spend. When a sovereign stack absorbs the bulk of daily token volume, the machine can pay for itself within a handful of months — at which point everything after is profit against the old bill.

The honest framing keeps the cloud in the picture. The durable pattern is local-first, not local-only: serve the high-volume daily mesh (agent decisions, summarization, classification, tagging, chat surfaces) on owned hardware, and route the rare hard-reasoning or long-form task to a frontier API by exception. A common outcome is roughly 95/5 by token volume — local for nearly everything, cloud for the 5% that genuinely needs the smartest model — which captures almost all the savings while preserving frontier quality where it actually matters. Sovereign economics is not about never paying for the cloud; it is about no longer paying for it by default.