FinOps after the GPU bill: what we learned
Three patterns that cut a USD 4M monthly inference bill by 53%.
The bill that surprised us
When the customer first showed me the GPU bill, I assumed there was a typo. There was no typo. Inference workloads had quietly grown to consume 41% of the customer’s entire cloud spend, and the trend line said it would be 60% by year-end.
Pattern 1 — Aggressive routing
Most queries do not need the largest model. We rebuilt the model gateway to route by complexity score. Trivial queries went to a 7B model; medium ones to a 70B; only the genuinely hard ones reached the flagship. Result: 38% of inference cost reduction, no quality regression on the eval set.
Pattern 2 — Caching the cacheable
A surprising fraction of production inference is repeat-trafficking the same questions. Aggressive cache hits cut another 12%. The cache is namespace-scoped per tenant; there is no cross-tenant leakage.
Pattern 3 — Off-peak batch
Anything async — summarisation, classification, enrichment — moved to overnight batch with reserved capacity at a third of the on-demand price. Another 18% saving.
The compound effect
Together: 53% reduction on a USD 4M monthly bill. The savings paid for the FinOps programme in four weeks. The customer redirected the avoided spend into doubling the size of the agent platform team.
Related articles.
Why mainframe modernisation finally pays back
6 minWhy mainframe modernisation finally pays back
Read articleThree patterns that make AI agents production-safe
4 minThree patterns that make AI agents production-safe
Read articleThe strangler-fig pattern, applied to ERPs
5 minThe strangler-fig pattern, applied to ERPs
Read articleWant to discuss this?
A senior partner will respond within one business day.