Cheap Frontier Models Run Hierarchical Orchestration
Five labs, four cheap value-tier models, one arbiter, and a complete production-grade feature-flag design — for about 17 cents. We ran a Hierarchical ensemble with four roles. Gemini 3.5 Flash set strategy, DeepSeek-V4-Pro implemented, Grok 4.3 and Qwen3.6-Flash reviewed, and Claude Haiku 4.5 synthesized the spec. Every stage is auditable, with a per-model cost ledger you can read line by line.
Time to read: 9–11 minutes
Session cost: Approx. $0.17 (Hierarchical, 4 workers + Claude Haiku 4.5 synthesis)
| Parameter | Value |
|---|---|
| Strategy | Hierarchical |
| Rounds | 1 (3 iterations configured) |
| Web Search | Disabled |
| Arbiter | Claude Haiku 4.5 |
| Models | Gemini 3.5 Flash, DeepSeek-V4-Pro, Grok 4.3, Qwen3.6-Flash |
Note: evaluations were not run for this session, so there are no judge scores. The value here is the design output and the transparent cost trail behind it.
Why test cheap models on a hard systems-design task?
Feature-flag services look simple and are not. The naive answer is a boolean in a config file. It collapses under percentage rollouts, targeting rules, flag dependencies, stale caches, and kill-switches that must propagate globally in seconds. It is a real production brief, the kind a platform team scopes over weeks.
Most multi-model "orchestration" products hide their routing behind a single flagship endpoint. Sakana's Fugu, for example, presents one premium model that internally decides who does what. You pay flagship prices and never see the seams. We wanted the opposite: a transparent hierarchy of cheap frontier models, each with a named role, each with its own cost line.
Design and implement a production-grade feature-flag service. Cover: the data model, an evaluation engine with percentage rollouts and targeting rules, a caching layer, a clean SDK API, and a test plan. Explicitly surface edge cases (flag dependencies, stale cache, kill-switch behavior) and security considerations.
We added the standard product-development clarifications: engineers implementing from scratch, a distributed multi-region deployment, and medium scale (1K–100K QPS, 100–10K flags).
Who were the contenders?
Hierarchical orchestration assigns each model a stage in a pipeline. Output flows Strategist → Implementer → Reviewers → Synthesizer.
| Model | Role | The Pitch |
|---|---|---|
| Gemini 3.5 Flash | Strategist | Fast, cheap planner. Sets the blueprint and the constraints. |
| DeepSeek-V4-Pro | Implementer | The builder. Turns the blueprint into schemas, code, and a test plan. |
| Grok 4.3 | Reviewer | Fast critic. Validates the plan and flags gaps. |
| Qwen3.6-Flash | Reviewer | Quality control. Audits the implementation against the blueprint. |
| Claude Haiku 4.5 | Synthesizer | The arbiter. Merges every stage into one production spec. |
How did four cheap models design a production service?
The hierarchy moved in four clean stages. Each model saw the prior stage's output and built on it.
Stage 1 — Strategist: Gemini 3.5 Flash sets the blueprint
Gemini opened with a "Strategic Blueprint" and an architecture split that the entire pipeline would inherit.
The goal of this initiative is to design and implement a highly resilient, low-latency, and secure Feature-Flag Service capable of supporting high-throughput production environments.
Its key move was separating a write-heavy Control Plane (flags, rules, dependencies, audit) from a read-heavy Data Plane optimized for sub-millisecond evaluation. It set this for $0.0332 in 16.8s — the most expensive single response in the round, yet still pocket change.
Stage 2 — Implementer: DeepSeek-V4-Pro builds it
DeepSeek accepted the blueprint and got to work.
No strategic issues identified. The strategic blueprint is comprehensive, practical, and establishes a clear separation of concerns. I will now proceed with the detailed implementation plan covering all requested areas.
It produced by far the longest response, 30,760 characters in 86s. The plan held a full PostgreSQL schema, a Go evaluation engine, and the two algorithms the design hinges on: deterministic percentage bucketing and acyclic dependency ordering.
bucket := hashFnv1a(flag.Key + salt + userKey) % 100 return bucket < percent
For dependencies it specified a topological sort that rejects cycles at save time, citing the textbook algorithm by name.
Topological sort (Kahn's algorithm); return error if cycle.

The cost ledger above tells the story of the round. DeepSeek wrote the most code and ran the longest, yet at $0.0115 it cost roughly a third of the Strategist's fee.
Stage 3 — Reviewers: Grok 4.3 and Qwen3.6-Flash audit the work
Grok reviewed fast and tersely, the shortest response of the round at 3,526 characters in 12.9s for $0.0063.
No strategic issues identified.
It restated the design as a crisp checklist, including its own variant of the bucketing rule and the kill-switch ordering.
Deterministic:
murmur3(user_id + flag_key + salt) % 100 < rollout_pct.
Kill-switch: checked first, short-circuits to
default_fallback.
Qwen3.6-Flash took the quality-control seat, producing a structured "Quality Control & Final Assessment Report."
Implementation B faithfully operationalizes every directive from the Strategic Blueprint (Model A).
It also did the sharpest comparative judgment of the round, ranking the detailed implementation above a thinner alternative.
Implementation C aligns conceptually but lacks the architectural specificity required for engineering handoff. It reads as a high-level sketch rather than a production specification.
Qwen did this for $0.0039 — the cheapest response in the session — despite emitting the most tokens of any worker (12,529).
Stage 4 — Synthesizer: Claude Haiku 4.5 ships the spec
The arbiter merged all four stages into one coherent production design. It covered the PostgreSQL data model, deterministic bucketing, a dependency DAG, and multi-tier caching. It also covered stale-cache versioning, a kill-switch fast path, SDK contracts, a test plan, and security controls.

What did the hierarchy cost?
Here is the full per-model ledger for the single round, before synthesis.
| Model | Role | Cost | Time | Tokens | Response |
|---|---|---|---|---|---|
| Gemini 3.5 Flash | Strategist | $0.0332 | 16.8s | 3,303 | 8,899 ch |
| DeepSeek-V4-Pro | Implementer | $0.0115 | 86.0s | 11,828 | 30,760 ch |
| Grok 4.3 | Reviewer | $0.0063 | 12.9s | 3,325 | 3,526 ch |
| Qwen3.6-Flash | Reviewer | $0.0039 | 22.2s | 12,529 | 7,216 ch |
The round cost $0.1107; Claude Haiku 4.5's synthesis added $0.0558, for a session total of Approx. $0.17 across 80,635 tokens and roughly 5.5 minutes.
The cost spread is the headline. Gemini's $0.0332 is about 8.5x Qwen's $0.0039 — even though Qwen produced more tokens (12,529 vs 3,303). Token count and price are decoupled across labs. A hierarchy lets you place the expensive model where it earns its keep, planning, and the cheap one where volume is fine, review.
How similar were the four answers?
Cosine similarity between responses shows the pipeline working as designed.
| Pair | Similarity | Reading |
|---|---|---|
| Gemini ↔ DeepSeek | 0.79 | Strategist and Implementer most aligned |
| DeepSeek ↔ Qwen | 0.72 | Reviewer tracks the implementation |
| Gemini ↔ Qwen | 0.69 | Reviewer also tracks the blueprint |
| DeepSeek ↔ Grok | 0.61 | Terse reviewer compresses the build |
| Grok ↔ Qwen | 0.57 | Two reviewers, different styles |
| Gemini ↔ Grok | 0.54 | Most distant pair |
The Strategist ↔ Implementer pair at 0.79 is the point of the exercise: DeepSeek built what Gemini specified rather than starting over. The lowest pair, Gemini ↔ Grok at 0.54, reflects the gap between a long blueprint and a one-screen review checklist.
What did the final synthesis produce?
Claude Haiku 4.5 returned a complete, structured specification. The excerpt below is taken verbatim from the synthesized answer.
# Production-Grade Feature-Flag Service: Complete Design & Implementation
## Executive Summary
This service decouples code deployment from feature release through a highly resilient, low-latency architecture supporting percentage rollouts, targeting rules, and global kill-switches. The system separates the **Control Plane** (write-heavy management) from the **Data Plane** (read-heavy, optimized for <1ms evaluation latency). All evaluations are deterministic, cache-aware, and gracefully degrade when the backend is unavailable.
## 1. Data Model
### 1.1 Relational Schema (PostgreSQL)
CREATE TABLE feature_flags (
id UUID PRIMARY KEY,
project_id UUID NOT NULL REFERENCES projects(id),
key TEXT NOT NULL,
flag_type TEXT NOT NULL DEFAULT 'boolean',
default_fallback_value JSONB NOT NULL,
kill_switch BOOLEAN NOT NULL DEFAULT false,
salt TEXT NOT NULL,
config_version BIGINT NOT NULL DEFAULT 0,
UNIQUE (project_id, key)
);
The full synthesis continues past this excerpt. It adds the targeting-rules and dependency tables and the deterministic bucketing rule hash(flag_key + salt + user_id) % 100. It specifies a multi-tier cache: SDK in-memory, SSE push, a polling fallback, and Redis Pub/Sub at the edge proxy. It then handles stale caches with monotonic config versioning and a kill-switch fast path. Finally it lists SDK API contracts, a test plan with an edge-case regression suite, and security controls for auth, data privacy, and attack prevention.
The arbiter also proposed four follow-up directions for the team. These were operational rollback and version management, cost and performance at massive flag cardinality, multi-region deployment and conflict resolution, and a client-side flag scope and permission model.
The Verdict
🏆 Claude Haiku 4.5 (Synthesizer) delivered the shippable artifact. It merged a blueprint, a 30,000-character implementation, and two reviews into one consistent spec — the document a team would actually hand to engineering. As an arbiter, cheap-but-careful beats expensive-but-opaque.
DeepSeek-V4-Pro (Implementer) carried the round. It wrote the most code and the only end-to-end implementation, including the FNV-1a bucketing function and Kahn's-algorithm DAG. At $0.0115 for 30,760 characters, it is the value engine of this hierarchy — slow at 86s, but worth the wait.
Gemini 3.5 Flash (Strategist) earned its premium. The most expensive worker at $0.0332, it set the Control Plane / Data Plane split that every later stage adopted. A good blueprint pays for itself downstream.
Grok 4.3 and Qwen3.6-Flash (Reviewers) were near-free quality gates. Grok's $0.0063 terse checklist and Qwen's $0.0039 structured audit caught alignment between blueprint and build for under a penny each.
Strategic takeaway: A transparent Hierarchical ensemble of cheap frontier models — five labs, named roles, a line-item ledger — produced a thorough production design for Approx. $0.17. Contrast that with a single black-box orchestration endpoint at flagship prices, where you cannot see who did what or why. When the routing is visible, you can put the costly model where judgment matters and the cheap one where volume is fine. The 8.5x cost spread across workers is not a problem to hide — it is a dial to tune.
Try It Yourself
- Open the AI Crucible dashboard and select Gemini 3.5 Flash, DeepSeek-V4-Pro, Grok 4.3, and Qwen3.6-Flash.
- Pick the Hierarchical strategy with Claude Haiku 4.5 as the arbiter.
- Paste a real systems-design brief with explicit edge cases and watch the pipeline build, review, and synthesize.
Suggested prompt variation:
Design and implement a production-grade rate-limiting service. Cover: the
data model, an enforcement engine with per-tenant quotas and burst handling,
a caching layer, a clean SDK API, and a test plan. Explicitly surface edge
cases (clock skew, hot keys, region failover) and security considerations.
Explore the design: Read the full run and raw model outputs in the Shared Chat Session.
Further Reading
- The Seven Ensemble Strategies — How Hierarchical fits alongside the other six orchestration patterns.
- Multi-Agent Orchestration — Why pipelines of specialized models beat a single endpoint.
- Cost and Token Optimizations — The systems that keep a multi-model run this cheap.
- Understanding AI Roles — How Strategist, Implementer, Reviewer, and Synthesizer roles are assigned.