GPU Solutions

Pricing · platform

Your private AI environment, running 10× faster. Truly sovereign.

Coding assistant and inference endpoints with the latest open-source models —GLM 5.1, Qwen 3.6, Llama 3.3, DeepSeek V3.5— on dedicated NVIDIA B200 GPUs in Madrid. Your code and prompts never leave the perimeter.

10×

Faster than a MacBook M4 Max on the same model

3.2×

Faster than an RTX 6000 Ada workstation

95 ms

Time to first token (2k prompt)

3-5

Concurrent developers per slice

How the slice works

Your slice is yours. By hardware. All the time.

We use NVIDIA Multi-Instance GPU (MIG): the B200 is physically partitioned into isolated instances. Each slice has its own compute, HBM3e memory, cache, and bandwidth. You don't compete with anyone for cycles. Your 1/4 is always your 1/4, even when the rest of the GPU is maxed out.

  • Hardware isolation (not time-slicing, not virtualization): SMs, memory, and cache are physically separated between slices.
  • Guaranteed bandwidth: your share of HBM3e doesn't slow down if other customers saturate their slice.
  • Reserved 24/7 with a monthly contract, or on-demand by the hour when you hit traffic peaks.
NVIDIA B200 · MIG
192 GB HBM3e
1/4
48 GB
1/4
48 GB
1/4
48 GB
1/4
48 GB
Dedicated compute · memory · cacheBandwidth per slice · ~2 TB/s

Each slice = isolated SMs + HBM3e + L2 cache + NVDEC/NVENC · no noisy neighbor

Real speed

Same models — only the place they run changes.

Tokens per second, single-user inference, Llama 3.3 70B and Qwen 3.6 Coder 32B. The gap isn't subtle — and it decides whether a coding assistant feels instant or frustrating.

Sources: NVIDIA MLPerf Inference v4.1 · Blackwell whitepaper · vLLM · Apple MLX · LocalLLaMA. Conservative numbers.

swipe to see the full table
Metric

MacBook Pro M4 Max

128 GB unified · MLX · Q4

RTX 6000 Ada

48 GB · AWQ-4bit · workstation

1/4 B200 · GPU Solutions

MIG · 48 GB HBM3e · native FP8

Available memory
≈ 96 GB usable
48 GB GDDR6
48 GB HBM3e
Memory bandwidth
546 GB/s
960 GB/s
≈ 2 TB/s
Peak compute
34 TFLOPS FP16
365 TFLOPS FP8
1,1 PFLOPS FP8
Llama 3.3 70B
12 tok/s
36 tok/s
115 tok/s
Qwen 3.6 Coder 32B
48 tok/s
88 tok/s
320 tok/s
GLM 5.1 235B · MoE
22 tok/s
62 tok/s
205 tok/s
TTFT · 2k prompt
820 ms
450 ms
95 ms
Concurrent devs
1
1-2
3-5
Context
Senior engineer laptop
Workstation ~€8,500
From €750/month · no CapEx

LLM inference is memory-bandwidth bound, not FLOPS-bound. HBM3e delivers ~2× the bandwidth of RTX 6000 Ada's GDDR6 and ~4× the M4 Max's unified memory — that's why a B200 slice beats both on the same models. Large models (72B+, MoE) don't fit on workstations without quality loss. On B200 they fit at native FP8 precision.

Why a dedicated slice

Your AI, inside your perimeter. No exceptions.

With a public API, your prompts train the next model and your data crosses three continents before returning. With a dedicated slice in Madrid, nothing leaves. Same model, isolated environment, compliance by design — and on top, 10× faster.

What happens in your slice, stays in your slice

Privacy, compliance and sovereignty built in. Not add-ons.

01

Data in Spain, 100%

Prompts, embeddings and responses never leave Madrid. Zero CLOUD Act exposure, zero US sub-processors, zero international transfers for Legal to sign.

02

Private model and context

Your B200 slice is yours with MIG hardware isolation. Your inputs don't train the next model, and your throughput doesn't depend on the tenant next door. Nobody else touches your weights.

03

ISO 27001 + ENS Media included

Your auditor gets the certificates directly. Your CISO closes due diligence without expanding the SoA. No extra audits, no ambiguous DPAs.

04

Dedicated endpoint, not shared

Private HTTPS with mTLS + VPN, only reachable from your IPs. No enforced rate limits, no inference queues. The latency is yours, 24/7.

05

InfiniBand co-location

Your pod, your storage and your tokens live in the same rack, wired over InfiniBand. Fewer hops, lower latency, zero cross-region egress. Your multi-step agent doesn't choke on the network.

The analogy

Madrid → New York is the same 5,750 km. By ship or by plane.

By ship

5,750 km

10 days

By plane

5,750 km

7 hours

Nobody pays for kilometers. You pay to arrive on time.

Same thing in AI

One million Llama 3.3 70B tokens. Depending on where it runs.

MacBook M4 Max · 12 t/s

1M tokens

23 hours

RTX 6000 Ada · 35 t/s

1M tokens

8 hours

1/4 B200 at GPU Solutions · 115 t/s

1M tokens

2.4 hours

Same work done. One tenth the time your team spends waiting.

And time pays for itself too

Operational savings are a side effect. They still cover the slice 5× over.

01

Team

10 devs

× 80 €/h

02

Idle time

30 min/day

× 220 workdays

03

Annual cost lost

88,000 €

1100 h/year idle

04

Annual 1/4 slice

14,280 €

1/4 slice reserved

Return on time

+ 73,720 €/year

6× the slice

The real reason to switch is sovereignty and compliance. Recovered time is the bonus that wins over Finance.

Your data, your model, your latency. And your team stops waiting, too.

Mix them

Three modes. You build the combo.

Reserve a slice for your own model. Add hourly bursts when traffic spikes. And pull Token Factory tokens for a big model when you don't want to manage the GPU. All in the same cluster, all sovereign, each line billed separately — no surprises.

01 / Reserved€/month

€/month · dedicated GPU

Fixed monthly fee for a 24/7 MIG slice. The GPU is yours: start and stop whenever without losing the assignment. Ideal for dev teams and stable production.

Best for stable production

02 / On-demand€/hour

€/hour · pay as you use

Spin up a slice or a full GPU and pay hourly until you shut it down. No commitment, no reservation needed. Available immediately via dashboard or API.

Best for spikes and POCs

03 / Endpoints€/1M tokens

€/1M tokens · Token Factory

Pay only for the tokens the model generates. No GPU management. Call the private HTTPS endpoint from your app. Perfect for variable-scale production inference.

Best for product inference

GPU Compute with MIG

From 1/4 to full cluster. Always dedicated.

Three MIG slice sizes (1/4, 1/2, full GPU), plus the HGX 8× cluster for training and enterprise workloads. Same API, same per-slice latency, scale from prototype to production without migration.

01 / Slice

1/4

B200

Memory48 GB HBM3e
Bandwidth≈ 2 TB/s

Coding assistant for 3-5 devs · light fine-tuning · models up to 70B with large context. The entry point.

Reserved

1.190 €/month

On-demand

1,95 €/hour

Get started

02 / Half

1/2

B200

Most popular
Memory96 GB HBM3e
Bandwidth≈ 4 TB/s

Real production for 8-12 devs · 70B inference at native FP8 precision · training of small-to-mid models.

Reserved

2.290 €/month

On-demand

3,95 €/hour

Talk to sales

03 / Full B200

1 ×

B200

Memory192 GB HBM3e
Bandwidth8 TB/s

72B models at FP8 full precision · high-throughput inference for teams of 15+ devs · distributed training.

Reserved

5.990 €/month

On-demand

7,90 €/hour

Talk to sales

04 / HGX Cluster

8 ×B200

8× B200 with intra-node NVLink 5 and inter-node InfiniBand NDR · foundation model training · inference at scale · dedicated enterprise compliance.

Memory1,5 TB HBM3e
Bandwidth64 TB/s aggregate
Talk to sales

Token Factory

The latest open-source models. Served fast.

We charge a bit more per million tokens. In return, your prompts and context never leave Madrid — and tokens are generated in the same cluster where your pod lives, wired over InfiniBand. More sovereignty, and because they sit right next to you, more speed.

ModelParamsContextInput / 1MOutput / 1MSpeed (1/4 B200)
GLMGLM 5.1New
235B · MoE200k0,902,40180 t/s
QwenQwen 3.6
72B256k0,701,80140 t/s
QwenQwen 3.6 CoderCoding
32B256k0,401,10320 t/s
QwenQwen 3.6Fast
14B128k0,200,55540 t/s
MetaLlama 3.3
70B128k0,601,60115 t/s
DeepSeekDeepSeek V3.5Fast
236B · MoE128k0,451,20220 t/s
MistralMistral Large 3
123B128k0,852,2095 t/s

Prices in euros per million tokens, pay-as-you-go, public list for retail volume. Speed in tokens/second single-user on a 1/4 B200 slice; 1/2 and full scale proportionally. High volume or your own fine-tuned model on a dedicated slice? We deploy on a private endpoint at a negotiated rate — ask us.

Where your code lives

Three places. One gives you all of them.

There's no always-right option. There's one that combines speed, privacy, and capacity — and two that force a trade-off.

01 / On your laptop

Local, on your machine

Maximum physical privacy — nothing leaves the device — but bounded by RAM and bandwidth. Big models don't fit or run slowly. Your laptop is unusable during inference.

Speed15
Privacy70
Model capacity20

Gains privacy · loses speed and capacity

02 / Public API

Third-party API

Fast, with powerful models, but every prompt travels to someone else's servers, with variable retention policies and jurisdiction that shifts per provider. Internal compliance will cost you hours.

Speed80
Privacy15
Model capacity85

Gains speed · loses privacy

03 / Your slice at GPU SolutionsBalanced

Dedicated cluster in Madrid

B200 cluster speed with HBM3e, latest-gen models at native precision, VM-level isolation. Prompts and code process here. 100% Spanish data residency, ISO 27001 and ENS Media certified.

Speed95
Privacy100
Model capacity100

Speed · privacy · capacity

All plans include

ISO 27001 + ENS Media
100% data in Spain
VM-level isolation
Encrypted storage
Support in Spanish and English
No vendor lock-in

Tailored proposal

Every use case is different. Tell us what you want to do and we'll send you a concrete proposal in under 24 hours.

Request proposal