FinOps after the GPU bill: what we learned

RB

Raghav Balasubramanyam

Feb 2026 · 7 min read · FinOps

The bill that surprised us

When the customer first showed me the GPU bill, I assumed there was a typo. There was no typo. Inference workloads had quietly grown to consume 41% of the customer’s entire cloud spend, and the trend line said it would be 60% by year-end.

Pattern 1 — Aggressive routing

Most queries do not need the largest model. We rebuilt the model gateway to route by complexity score. Trivial queries went to a 7B model; medium ones to a 70B; only the genuinely hard ones reached the flagship. Result: 38% of inference cost reduction, no quality regression on the eval set.

Pattern 2 — Caching the cacheable

A surprising fraction of production inference is repeat-trafficking the same questions. Aggressive cache hits cut another 12%. The cache is namespace-scoped per tenant; there is no cross-tenant leakage.

Pattern 3 — Off-peak batch

Anything async — summarisation, classification, enrichment — moved to overnight batch with reserved capacity at a third of the on-demand price. Another 18% saving.

The compound effect

Together: 53% reduction on a USD 4M monthly bill. The savings paid for the FinOps programme in four weeks. The customer redirected the avoided spend into doubling the size of the agent platform team.

Want to discuss this?

A senior partner will respond within one business day.

Request Briefing