Back to blog
Engineering Feb 20, 2026 18 min read

How We Cut GPU Costs 70% — The Architecture Behind Runflow

From $90K/mo to $27K/mo. The multi-cloud orchestration system that powers our inference platform, born from running 100K+ AI jobs at BetterPic.

Miguel P.
Miguel P.
Head of Infrastructure

When we were running BetterPic at scale, our GPU bill was the single biggest line item. $47K/month for a workload that should have cost a fraction of that. The problem wasn’t the models—it was the infrastructure.

This is the story of how we rebuilt our GPU orchestration layer from scratch, cut costs by 70%, and turned that architecture into Runflow.

The Problem: Single-Cloud Lock-in

Like most AI startups, we started on a single cloud provider. One region, one GPU type, one set of pricing. The problems:

  • No spot market access. We were paying on-demand rates 24/7 for workloads that were bursty by nature.
  • Cold start taxes. Models sat loaded in VRAM even during off-peak hours. We were paying for idle GPUs.
  • No failover. When our provider had capacity issues, our entire pipeline went down.

The Architecture

Request → Router → [ Cloud A (spot)  |  Cloud B (reserved)  |  Cloud C (on-demand)  ]
              ↓
        Score: latency × 0.3 + cost × 0.4 + availability × 0.3
              ↓
        Best provider → Execute → Return

The router evaluates every available GPU endpoint in real-time, scoring each on three dimensions: latency (how fast can it respond), cost (spot vs reserved vs on-demand pricing), and availability (current queue depth and error rates).

The Results

Before
$47K
Monthly GPU spend
After
$14K
Monthly GPU spend
Reduction
70%
Cost savings

But the cost reduction was almost a side effect. The real wins were operational:

  • 99.97% uptime — up from 99.2% on single-cloud
  • p95 latency dropped 40% — geographic routing puts compute closer to users
  • Zero capacity-related outages — automatic failover across 3 providers

Lessons Learned

  • Spot instances are underrated for inference. Unlike training, inference jobs are short-lived and stateless. Spot interruptions are annoying but not catastrophic—just retry on another provider. We run 60% of our workload on spot.
  • Cold start optimization matters more than raw GPU speed. A model that loads in 2s on a slower GPU often beats a faster GPU with a 15s cold start. We pre-warm models based on predicted demand patterns.
  • Provider diversity is a feature, not overhead. Every provider has different pricing, availability patterns, and failure modes. Multi-cloud isn’t about redundancy—it’s about arbitrage.

This architecture is now the core of Runflow. If you’re running inference at scale and want to see how much you could save, talk to our team.

ArchitectureGPUInfrastructure

Want custom benchmarks for your workload?

We'll run our evaluation pipeline against your production data, for free.

Talk to Founders