Portrait Generation Benchmark Q1 2026: Flux.2 vs SDXL vs Proprietary
Back to blog
Benchmarks Feb 24, 2026 12 min read

Portrait Generation Benchmark Q1 2026: Flux.2 vs SDXL vs Proprietary

We ran 12,000 inference jobs across 8 models on real production workloads. Here's what we found about quality, cost, and latency tradeoffs for AI headshot generation.

Ziad
Ziad
Head of ML

Every quarter, we benchmark every major image generation model against real production workloads from our platform. Not synthetic tests, actual jobs from customers generating AI headshots at scale.

This quarter, we tested 8 models across 12,000 inference jobs, scoring each on quality (FID, CLIP, human eval), cost per image, and p95 latency. Here’s the full breakdown.

Why We Benchmark Differently

Most model comparisons use academic datasets, ImageNet, LAION, curated prompt sets. That’s useful for research, but it tells you nothing about how a model performs on your workload.

At Runflow, we route tens of thousands of real inference jobs per day. We see exactly how models perform on corporate headshots, e-commerce product photos, and creative portraits, the actual use cases customers care about.

Our Sentinel evaluation engine scores every output automatically across three dimensions:

  • FID Score — Measures distributional similarity to high-quality reference sets, per niche
  • CLIP Alignment — How well the output matches the input prompt and reference image
  • Human Eval — Blind A/B testing with trained evaluators (n=500 per model pair)

The Models

We tested the following models, all running on our multi-cloud orchestration layer to normalize for infrastructure differences:

ModelVersionTypeProvider
Flux.2 [dev]v2.0.1Open SourceSelf-hosted
Flux.2 [schnell]v2.0.1Open SourceSelf-hosted
SDXL Lightning4-stepOpen SourceSelf-hosted
SDXL Turbo1-stepOpen SourceSelf-hosted
Proprietary A-Closed SourceAPI
Proprietary B-Closed SourceAPI

Results: Quality Scores

The composite quality score combines FID (40%), CLIP alignment (30%), and human evaluation (30%). All scores are normalized to a 0–100 scale.

Flux.2 [dev]
95
Proprietary A
94
Flux.2 [schnell]
91
SDXL Lightning
91
Proprietary B
88
SDXL Turbo
82

The headline: Flux.2 [dev] scored 95, matching or exceeding proprietary models across all three evaluation dimensions. For the first time in our benchmarks, an open-source model leads the portrait generation category outright.

Results: Cost per Image

Cost calculations include GPU compute, orchestration overhead, and our platform fee. All models were run on equivalent hardware (A100 80GB) through our multi-cloud orchestration layer.

SDXL Turbo
$0.0008
Fastest, lowest quality
Flux.2 [schnell]
$0.0013
Best value
Flux.2 [dev]
$0.0096
Best quality

Results: Latency (p95)

Latency was measured end-to-end from API request to image delivery, including model loading (cold start) and network transfer. All measurements are p95 across the full 12K job dataset.

  • SDXL Turbo: 0.8s — Single step, extremely fast
  • Flux.2 [schnell]: 1.2s — 4 steps, excellent tradeoff
  • SDXL Lightning: 1.4s — 4 steps, solid performance
  • Flux.2 [dev]: 4.8s — 20 steps, highest quality
  • Proprietary A: 6.2s — API overhead adds latency
  • Proprietary B: 8.1s — Slowest, queue-based

Key Takeaways

  1. Open source has caught up. Flux.2 [dev] matches proprietary quality at a fraction of the compute cost. The moat for closed-source portrait models is effectively gone.
  2. Speed vs quality is a real tradeoff. SDXL Turbo is 6x faster than Flux.2 [dev] but scores 13 points lower. Choose based on your use case.
  3. Per-niche scoring matters. SDXL Lightning beats Flux.2 [schnell] on corporate headshots but loses on creative portraits. Aggregate scores hide important nuances.
  4. Reliability is infrastructure, not model choice. The same model can have wildly different uptime depending on your GPU provider. Runflow routes across multiple datacenters for consistent availability.

Methodology Notes

All benchmark results are reproducible. We publish our evaluation pipeline, reference datasets, and scoring rubrics in our open benchmark repository. If you find discrepancies, we want to know—open an issue or reach out directly.

Models labeled “Proprietary A” and “Proprietary B” are anonymized per our testing agreements. We’ll name them explicitly once we have permission from the providers.

What’s Next

Q2 benchmarks will expand to include video generation models (Wan2.6, Kling 2.1, Seedance) and our new virtual try-on pipeline. We’re also adding latency-under-load testing to simulate real production traffic patterns.

Want to run these benchmarks on your own workload? Talk to our team — we’ll set up a custom evaluation against your production data.

Test

BenchmarkPortraitsFlux.2SDXLImage Generation

Want custom benchmarks for your workload?

We'll run our evaluation pipeline against your production data, for free.

Talk to Founders