Cheap Frontier Models Run Hierarchical Orchestration

Q: What did the hierarchy cost?

Here is the full per-model ledger for the single round, before synthesis.

Five labs, four cheap value-tier models, one arbiter, and a complete production-grade feature-flag design — for about 17 cents. We ran a Hierarchical ensemble with four roles. Gemini 3.5 Flash set strategy, DeepSeek-V4-Pro implemented, Grok 4.3 and Qwen3.6-Flash reviewed, and Claude Haiku 4.5 synthesized the spec. Every stage is auditable, with a per-model cost ledger you can read line by line.

Time to read: 9–11 minutes

Session cost: Approx. $0.17 (Hierarchical, 4 workers + Claude Haiku 4.5 synthesis)

Parameter	Value
Strategy	Hierarchical
Rounds	1 (3 iterations configured)
Web Search	Disabled
Arbiter	Claude Haiku 4.5
Models	Gemini 3.5 Flash, DeepSeek-V4-Pro, Grok 4.3, Qwen3.6-Flash

Note: evaluations were not run for this session, so there are no judge scores. The value here is the design output and the transparent cost trail behind it.

Why test cheap models on a hard systems-design task?

Feature-flag services look simple and are not. The naive answer is a boolean in a config file. It collapses under percentage rollouts, targeting rules, flag dependencies, stale caches, and kill-switches that must propagate globally in seconds. It is a real production brief, the kind a platform team scopes over weeks.

Most multi-model "orchestration" products hide their routing behind a single flagship endpoint. Sakana's Fugu, for example, presents one premium model that internally decides who does what. You pay flagship prices and never see the seams. We wanted the opposite: a transparent hierarchy of cheap frontier models, each with a named role, each with its own cost line.

Design and implement a production-grade feature-flag service. Cover: the data model, an evaluation engine with percentage rollouts and targeting rules, a caching layer, a clean SDK API, and a test plan. Explicitly surface edge cases (flag dependencies, stale cache, kill-switch behavior) and security considerations.

We added the standard product-development clarifications: engineers implementing from scratch, a distributed multi-region deployment, and medium scale (1K–100K QPS, 100–10K flags).

View the full chat here

Who were the contenders?

Hierarchical orchestration assigns each model a stage in a pipeline. Output flows Strategist → Implementer → Reviewers → Synthesizer.

Model	Role	The Pitch
Gemini 3.5 Flash	Strategist	Fast, cheap planner. Sets the blueprint and the constraints.
DeepSeek-V4-Pro	Implementer	The builder. Turns the blueprint into schemas, code, and a test plan.
Grok 4.3	Reviewer	Fast critic. Validates the plan and flags gaps.
Qwen3.6-Flash	Reviewer	Quality control. Audits the implementation against the blueprint.
Claude Haiku 4.5	Synthesizer	The arbiter. Merges every stage into one production spec.

How did four cheap models design a production service?

The hierarchy moved in four clean stages. Each model saw the prior stage's output and built on it.

Stage 1 — Strategist: Gemini 3.5 Flash sets the blueprint

Gemini opened with a "Strategic Blueprint" and an architecture split that the entire pipeline would inherit.

The goal of this initiative is to design and implement a highly resilient, low-latency, and secure Feature-Flag Service capable of supporting high-throughput production environments.

Its key move was separating a write-heavy Control Plane (flags, rules, dependencies, audit) from a read-heavy Data Plane optimized for sub-millisecond evaluation. It set this for $0.0332 in 16.8s — the most expensive single response in the round, yet still pocket change.

Stage 2 — Implementer: DeepSeek-V4-Pro builds it

DeepSeek accepted the blueprint and got to work.

No strategic issues identified. The strategic blueprint is comprehensive, practical, and establishes a clear separation of concerns. I will now proceed with the detailed implementation plan covering all requested areas.

It produced by far the longest response, 30,760 characters in 86s. The plan held a full PostgreSQL schema, a Go evaluation engine, and the two algorithms the design hinges on: deterministic percentage bucketing and acyclic dependency ordering.

bucket := hashFnv1a(flag.Key + salt + userKey) % 100 return bucket < percent

For dependencies it specified a topological sort that rejects cycles at save time, citing the textbook algorithm by name.

Topological sort (Kahn's algorithm); return error if cycle.

Per-model cost ledger

The cost ledger above tells the story of the round. DeepSeek wrote the most code and ran the longest, yet at $0.0115 it cost roughly a third of the Strategist's fee.

Stage 3 — Reviewers: Grok 4.3 and Qwen3.6-Flash audit the work

Grok reviewed fast and tersely, the shortest response of the round at 3,526 characters in 12.9s for $0.0063.

No strategic issues identified.

It restated the design as a crisp checklist, including its own variant of the bucketing rule and the kill-switch ordering.

Deterministic: murmur3(user_id + flag_key + salt) % 100 < rollout_pct.

Kill-switch: checked first, short-circuits to default_fallback.

Qwen3.6-Flash took the quality-control seat, producing a structured "Quality Control & Final Assessment Report."

Implementation B faithfully operationalizes every directive from the Strategic Blueprint (Model A).

It also did the sharpest comparative judgment of the round, ranking the detailed implementation above a thinner alternative.

Implementation C aligns conceptually but lacks the architectural specificity required for engineering handoff. It reads as a high-level sketch rather than a production specification.

Qwen did this for $0.0039 — the cheapest response in the session — despite emitting the most tokens of any worker (12,529).

Stage 4 — Synthesizer: Claude Haiku 4.5 ships the spec

The arbiter merged all four stages into one coherent production design. It covered the PostgreSQL data model, deterministic bucketing, a dependency DAG, and multi-tier caching. It also covered stale-cache versioning, a kill-switch fast path, SDK contracts, a test plan, and security controls.

Synthesized production design

What did the hierarchy cost?

Here is the full per-model ledger for the single round, before synthesis.

Model	Role	Cost	Time	Tokens	Response
Gemini 3.5 Flash	Strategist	$0.0332	16.8s	3,303	8,899 ch
DeepSeek-V4-Pro	Implementer	$0.0115	86.0s	11,828	30,760 ch
Grok 4.3	Reviewer	$0.0063	12.9s	3,325	3,526 ch
Qwen3.6-Flash	Reviewer	$0.0039	22.2s	12,529	7,216 ch

The round cost $0.1107; Claude Haiku 4.5's synthesis added $0.0558, for a session total of Approx. $0.17 across 80,635 tokens and roughly 5.5 minutes.

The cost spread is the headline. Gemini's $0.0332 is about 8.5x Qwen's $0.0039 — even though Qwen produced more tokens (12,529 vs 3,303). Token count and price are decoupled across labs. A hierarchy lets you place the expensive model where it earns its keep, planning, and the cheap one where volume is fine, review.

How similar were the four answers?

Cosine similarity between responses shows the pipeline working as designed.

Pair	Similarity	Reading
Gemini ↔ DeepSeek	0.79	Strategist and Implementer most aligned
DeepSeek ↔ Qwen	0.72	Reviewer tracks the implementation
Gemini ↔ Qwen	0.69	Reviewer also tracks the blueprint
DeepSeek ↔ Grok	0.61	Terse reviewer compresses the build
Grok ↔ Qwen	0.57	Two reviewers, different styles
Gemini ↔ Grok	0.54	Most distant pair

The Strategist ↔ Implementer pair at 0.79 is the point of the exercise: DeepSeek built what Gemini specified rather than starting over. The lowest pair, Gemini ↔ Grok at 0.54, reflects the gap between a long blueprint and a one-screen review checklist.

What did the final synthesis produce?

Claude Haiku 4.5 returned a complete, structured specification. The excerpt below is taken verbatim from the synthesized answer.

# Production-Grade Feature-Flag Service: Complete Design & Implementation

## Executive Summary

This service decouples code deployment from feature release through a highly resilient, low-latency architecture supporting percentage rollouts, targeting rules, and global kill-switches. The system separates the **Control Plane** (write-heavy management) from the **Data Plane** (read-heavy, optimized for <1ms evaluation latency). All evaluations are deterministic, cache-aware, and gracefully degrade when the backend is unavailable.

## 1. Data Model

### 1.1 Relational Schema (PostgreSQL)

CREATE TABLE feature_flags (
    id              UUID PRIMARY KEY,
    project_id      UUID NOT NULL REFERENCES projects(id),
    key             TEXT NOT NULL,
    flag_type       TEXT NOT NULL DEFAULT 'boolean',
    default_fallback_value JSONB NOT NULL,
    kill_switch     BOOLEAN NOT NULL DEFAULT false,
    salt            TEXT NOT NULL,
    config_version  BIGINT NOT NULL DEFAULT 0,
    UNIQUE (project_id, key)
);

The full synthesis continues past this excerpt. It adds the targeting-rules and dependency tables and the deterministic bucketing rule hash(flag_key + salt + user_id) % 100. It specifies a multi-tier cache: SDK in-memory, SSE push, a polling fallback, and Redis Pub/Sub at the edge proxy. It then handles stale caches with monotonic config versioning and a kill-switch fast path. Finally it lists SDK API contracts, a test plan with an edge-case regression suite, and security controls for auth, data privacy, and attack prevention.

The arbiter also proposed four follow-up directions for the team. These were operational rollback and version management, cost and performance at massive flag cardinality, multi-region deployment and conflict resolution, and a client-side flag scope and permission model.

The Verdict

🏆 Claude Haiku 4.5 (Synthesizer) delivered the shippable artifact. It merged a blueprint, a 30,000-character implementation, and two reviews into one consistent spec — the document a team would actually hand to engineering. As an arbiter, cheap-but-careful beats expensive-but-opaque.

DeepSeek-V4-Pro (Implementer) carried the round. It wrote the most code and the only end-to-end implementation, including the FNV-1a bucketing function and Kahn's-algorithm DAG. At $0.0115 for 30,760 characters, it is the value engine of this hierarchy — slow at 86s, but worth the wait.

Gemini 3.5 Flash (Strategist) earned its premium. The most expensive worker at $0.0332, it set the Control Plane / Data Plane split that every later stage adopted. A good blueprint pays for itself downstream.

Grok 4.3 and Qwen3.6-Flash (Reviewers) were near-free quality gates. Grok's $0.0063 terse checklist and Qwen's $0.0039 structured audit caught alignment between blueprint and build for under a penny each.

Strategic takeaway: A transparent Hierarchical ensemble of cheap frontier models — five labs, named roles, a line-item ledger — produced a thorough production design for Approx. $0.17. Contrast that with a single black-box orchestration endpoint at flagship prices, where you cannot see who did what or why. When the routing is visible, you can put the costly model where judgment matters and the cheap one where volume is fine. The 8.5x cost spread across workers is not a problem to hide — it is a dial to tune.

Try It Yourself

Open the AI Crucible dashboard and select Gemini 3.5 Flash, DeepSeek-V4-Pro, Grok 4.3, and Qwen3.6-Flash.
Pick the Hierarchical strategy with Claude Haiku 4.5 as the arbiter.
Paste a real systems-design brief with explicit edge cases and watch the pipeline build, review, and synthesize.

Suggested prompt variation:

Design and implement a production-grade rate-limiting service. Cover: the
data model, an enforcement engine with per-tenant quotas and burst handling,
a caching layer, a clean SDK API, and a test plan. Explicitly surface edge
cases (clock skew, hot keys, region failover) and security considerations.

Explore the design: Read the full run and raw model outputs in the Shared Chat Session.