Back to blog
Engineering Feb 6, 2026 14 min read

Building Sentinel: Our Automated Model Evaluation System

How we score model outputs across FID, CLIP, and custom metrics, automatically, per-niche, at scale. The engine behind our benchmark tables.

Ziad
Ziad
Head of ML

When you run an image generation model, how do you know if the output is good? If you are processing ten images a day, you look at them. If you are processing a hundred thousand, you need something else entirely.

Sentinel is our answer: an automated evaluation engine that scores every output generated through Runflow across multiple quality dimensions, calibrated per niche, at scale. It processes over 100,000 outputs per day and powers every benchmark table on our site, including the Q1 2026 Portrait Generation Benchmark.

This post covers why we built it, how it works, and the technical challenges of running automated quality evaluation at production scale.

Why Automated Evaluation Matters

Manual quality assurance does not scale. At BetterPic, we started with human reviewers checking every generated headshot. When volume hit 5,000 images per day, we had a team of three people doing nothing but reviewing outputs. The feedback loop was slow (hours, not seconds), inconsistent (reviewer fatigue and subjective drift), and expensive. Worse, by the time a quality issue was caught, hundreds of bad outputs had already been delivered to customers.

We needed three things: real-time scoring (every output, within seconds of generation), consistency (the same image gets the same score every time), and niche awareness (what constitutes “good” differs dramatically between a corporate headshot and a creative portrait).

Architecture Overview

Sentinel sits in the critical path of every inference job. The flow is:

Job Completion
    │
    ▼
┌─────────────────────────────────┐
│  Output Capture                 │  ← Image + metadata extraction
└──────────────┬──────────────────┘
               ▼
┌─────────────────────────────────┐
│  Multi-Metric Scoring           │  ← FID, CLIP, Human Eval proxy
└──────────────┬──────────────────┘
               ▼
┌─────────────────────────────────┐
│  Niche-Specific Weighting       │  ← Per-category calibration
└──────────────┬──────────────────┘
               ▼
┌─────────────────────────────────┐
│  Dashboard & Alerting           │  ← Real-time metrics + threshold alerts
└─────────────────────────────────┘

The entire pipeline runs asynchronously. The customer gets their image immediately; Sentinel scores it in the background (typically within 2–3 seconds of generation). If a score falls below a configurable threshold, an alert fires and the output can be flagged for human review or automatic regeneration.

The Three Scoring Dimensions

1. FID (Distributional Similarity)

FID, Frechet Inception Distance, measures how statistically similar a set of generated images is to a reference set of real images. A lower FID means the generated images share similar visual statistics (texture, color distribution, structural patterns) with real-world examples.

We compute FID against niche-specific reference sets. Our corporate headshot reference set contains 10,000 professionally photographed business portraits. Our creative portrait set contains 8,000 artistic and editorial portraits. This means FID scores are meaningful within a niche: a score of 22 on corporate headshots tells you something very different from a score of 22 on creative portraits.

2. CLIP Alignment

CLIP scores measure how well the generated image matches the input prompt and any reference images provided. We use OpenCLIP ViT-H/14 to compute cosine similarity between the text embedding of the prompt and the image embedding of the output.

This catches a different class of failures than FID. An image can look photorealistic (good FID) but completely ignore the prompt (bad CLIP). For example, generating a “woman in a blue blazer” and producing a man in a red jacket would score well on FID (it is still a well-formed portrait) but poorly on CLIP (it does not match the request).

3. Human Eval Proxy

This is the most nuanced dimension. We trained a custom evaluation model on 50,000 human preference judgments, A/B comparisons where trained evaluators selected the “better” image from a pair. The model learns subjective quality signals that FID and CLIP miss: natural skin tones, flattering lighting, appropriate cropping, and the overall “professional” look that matters for headshot use cases.

We retrain this model monthly with fresh human judgments to prevent drift. The correlation between our proxy model and actual human preferences is currently r=0.89, which means it agrees with human evaluators about 89% of the time.

Per-Niche Calibration

A corporate headshot and a creative portrait have fundamentally different quality criteria. Corporate headshots need neutral backgrounds, professional attire, natural skin tones, and standard framing. Creative portraits can have dramatic lighting, unusual angles, artistic backgrounds, and stylized processing.

Sentinel handles this through per-niche weight vectors. Each niche defines how much weight each scoring dimension gets in the composite score:

NicheFIDCLIPHuman Eval
Corporate Headshots45%25%30%
Creative Portraits25%35%40%
E-commerce Products50%30%20%

Corporate headshots weight FID heavily because statistical similarity to professional photography is the strongest predictor of quality in that niche. Creative portraits weight human eval more because subjective aesthetic judgment matters most when the goal is artistic expression.

Powering the Benchmark Tables

Every benchmark we publish on this blog is generated by Sentinel. When we say “Flux.2 [dev] scored 95 on portrait generation,” that number comes from running thousands of real production jobs through the full Sentinel pipeline and aggregating the niche-weighted composite scores.

This is what makes our benchmarks different from academic evaluations. We are not scoring on curated test sets; we are scoring on the actual workloads customers send through the platform. If a model performs well on academic benchmarks but poorly on real corporate headshots, Sentinel catches that immediately.

Production Stats

Daily Evaluations
100K+
Avg Scoring Time
2.3s
Human Eval Correlation
r=0.89
Active Niches
7

Challenges and Lessons

  • FID is noisy at small sample sizes. You need at least 500–1000 images for FID to be statistically reliable. For per-model, per-niche comparisons, this means waiting until enough data accumulates before publishing scores. We batch on 24-hour windows.
  • CLIP has known biases. CLIP tends to favor images with high contrast and saturated colors, which can penalize more natural, muted portrait styles. We counteract this with niche-specific CLIP calibration curves.
  • Human eval proxy drift is real. Without monthly retraining, our proxy model’s correlation with actual human preferences degrades by roughly 2% per month. Aesthetic preferences shift, and the model must keep up.
  • Latency budget is tight. Running three evaluation models per image within a 3-second budget requires careful GPU resource allocation. We run Sentinel scoring on dedicated inference GPUs separate from production generation to avoid resource contention.

What’s Next for Sentinel

We are working on three major additions. First, video quality scoring for the upcoming video generation benchmarks, which requires temporal consistency metrics beyond what FID and CLIP provide. Second, customer-specific calibration, allowing enterprise customers to provide their own reference sets and quality preferences. Third, real-time model routing based on Sentinel scores, automatically selecting the best model for each job based on live quality data rather than static benchmarks.

If you want to see Sentinel scores for your specific workload, reach out to our team and we will run a custom evaluation against your production data.

SentinelEvaluationML

Want custom benchmarks for your workload?

We'll run our evaluation pipeline against your production data, for free.

Talk to Founders