Economics15 April 20268 min read

Private inference: the speed that saves money (and the numbers that prove it)

Price per token is half the cost. The other half is your team waiting. We calculate the exact point where a dedicated slice beats any public API.

The AI pricing conversation stays anchored to one axis: €/M tokens. It's a decent axis for comparing generic text providers — a terrible one once AI enters your real workflow. There the relevant cost isn't tokens, it's your team waiting for them to arrive.

When a dedicated slice makes sense

Public OSS token providers (Groq, Together, Fireworks) sit between 0.60 €/M and 0.90 €/M output for a 70B model. We sit at 1.60 €/M. On paper we're 2× more expensive. Add speed to the equation and the numbers flip.

Llama 3.3 70B on RTX 6000 Ada

~35 tok/s — 8 hours to generate 1M tokens

Llama 3.3 70B on 1/4 B200 MIG

~115 tok/s — 2.4 hours for the same million

Llama 3.3 70B on a public shared pod

Variable: queues, rate-limits, ES→US→ES latency

Llama 3.3 70B on your dedicated slice

Fixed: zero queue, zero rate-limit, Madrid→your VPN latency

The real math isn't seconds — it's hours

The obvious cost of a shared coding assistant is seconds lost per autocomplete. Real, but not what kills quarters. What kills quarters is hours when the service simply doesn't respond. Two recurring patterns:

Global outages. The public status pages of Claude, OpenAI and GitHub Copilot document dozens of incident hours per service per year. For an organization relying on several simultaneously, aggregate exposure easily crosses 50 hours/year. A team of just 10 developers at €60/h fully-loaded (estimate based on Spain's INE ETCL 2025 for senior IT roles) loses ~€30,000/year in direct unbillable time through that channel alone — more than double the annual cost of a reserved 1/4 B200 slice. Claude, OpenAI, GitHub Copilot are excellent services, we use them too, but best-effort: no contractual SLA with real penalties for a single enterprise.

Concurrency rate-limits. Monday 9:30, your team kicks off the sprint and for the first 20 minutes the plugin returns 429. Not an outage — shared rate-limit because a bigger tenant is drawing capacity. Time lost never hits a public status dashboard and your developers just assume "AI is slow today".

A dedicated slice doesn't eliminate that we also have maintenance windows and failures — we do. What it eliminates is coupling with the saturation of a global service shared among thousands of customers. Your slice is yours, with a contractual SLA, no noisy neighbors, and a local Madrid phone that gets picked up.

Which model fits — and we run both

Important clarification: we offer both products. Token Factory (per-token, Madrid-hosted) for flexible usage, and dedicated slice (reserved GPU Compute) for fixed capacity. The choice isn't "us vs someone" — it's which of our products fits your usage pattern.

1/4 B200 reserved slice: ~€1,190/month. Practical capacity: 200-300M tokens/month before saturation. Token Factory runs on the same physical infrastructure, in the same jurisdiction.

Low or bursty usage (<50M tokens/month)

Token Factory — doesn't justify reserving capacity

Variable usage with peaks

Token Factory — elasticity without committing budget

Constant usage + sensitive data + NIS2/GDPR

Dedicated slice — hardware isolation, single jurisdiction

Deterministic latency or strict contractual SLA

Dedicated slice — zero queue, zero rate-limit

200M+ tokens/month sustained

Dedicated slice — amortization and fixed capacity

Pure €/token rarely settles the decision. At low volume, Token Factory wins on simplicity. At constant, critical volume, the slice wins on determinism and jurisdiction. And since both run out of the same Tier III in Madrid, sovereignty is resolved either way.

“The sweet spot is teams of 4-15 developers with constant flow, sensitive code, and a CFO who asks. A B200 slice saves them money in the first month.”

— GPU Solutions internal analysis, Q1 2026

What doesn't fit in the spreadsheet

Measurable savings (hours × rate) are half the case. The other half never shows up in the books because it's an avoided cost: the concentration loss every time you wait. A developer who flicks to Reddit for 10 seconds while waiting on autocomplete has left the task's context. Getting back costs between 23 seconds and 23 minutes per the literature (Mark, 2008; Czerwinski 2004). That cost never hits an invoice, but it's the one that kills real productivity.

The sovereign part

The numbers above are the hard case. But there's a second factor that flips the equation in banking, public sector, defense, and health: if your public API sits under CLOUD Act, the dedicated slice stops competing on €/token and starts competing on 'exists as an option'. We operate a single Tier III datacenter in Madrid, with ISO 27001 and ENS Media already certified, VM-level isolation, and hardware MIG slices. When the regulator asks where data is processed, the answer is one city, one jurisdiction.