Asked to write a legal brief on the FTC's non-compete rule, Gemini 3.5 Flash did something dangerous: it confidently invented a Supreme Court appeal that does not exist, then built the brief's central conclusion on top of the fiction. A single-model workflow would have shipped it. Instead, Claude Opus 4.8 — running as the Red Team — caught the fabrication, DeepSeek V4 adjudicated the dispute, and the final synthesis demoted the invented fact to a flagged caveat. This is what multi-model verification looks like when it works.

Time to read: 9-12 minutes

Session cost: Approx. $1.25 (3 rounds + GPT-5.5 arbiter synthesis)

Session Configuration

Parameter Value
Strategy Red Team / Blue Team
Rounds 3
Web Search Disabled
Arbiter GPT-5.5
Models Gemini 3.5 Flash, Claude Opus 4.8, DeepSeek-V4-Pro

Web search was deliberately disabled. The point of this run was to test what the models actually know, forcing them to answer from parametric memory. That is exactly the condition under which large language models fabricate citations and procedural history. Grounding the models in live search would have papered over the very failure mode we wanted to observe.

The Challenge

We handed the ensemble a high-stakes compliance question — the kind a paralegal or in-house counsel might delegate, and the kind where a confident wrong answer carries real cost:

Write a concise legal brief (about 350 words) on whether non-compete agreements for hourly workers are enforceable in the United States as of 2026, given the FTC's 2024 non-compete rule and the litigation that followed. Cite the controlling federal action and at least two court decisions challenging or upholding it (full case citations with court and year), state each holding, and flag any point where the law is currently unsettled.

Before the models ran, AI Crucible asked two clarifying questions and we pinned the scope: a current-date (2026) cutoff and a federal-only analysis. That framing matters — it pushes the models past their training cutoffs and asks them to commit to the current status of a fast-moving rule. That "flag any point where the law is unsettled" instruction is the trap. It rewards a model for manufacturing uncertainty if it cannot recall the real procedural posture.

View the full chat here

The Contenders

In Red Team / Blue Team, models do not all answer the same question. Each is assigned a job.

Model Role The Job
Gemini 3.5 Flash 🔵 Blue Team (Proposer) Draft the brief. Fast, fluent, confident.
Claude Opus 4.8 🔴 Red Team (Attacker) Attack the brief. Hunt for fabrications and weak reasoning.
DeepSeek-V4-Pro ⚪ White Team (Adjudicator) Judge whether each attack actually lands.
GPT-5.5 ⚖️ Arbiter (Synthesizer) Reconcile the fight into one defensible final brief.

Round 1: A Polished, Plausible Brief

Gemini 3.5 Flash produced exactly what you would want at first glance: a clean, well-structured brief with real-sounding citations and a confident narrative. It correctly identified the FTC's 2024 Non-Compete Clause Rule and the Ryan, LLC v. FTC decision that struck it down.

Then, to satisfy the "flag unsettled points" instruction, it added a flourish — a claim that the FTC's appeal was "currently pending before the U.S. Supreme Court (following appellate review by the Fifth Circuit)."

That sentence is fiction. And it was the load-bearing wall of the brief's conclusion.

This is the insidious thing about a capable model's hallucination: the surrounding 95% is accurate, which makes the fabricated 5% more believable, not less. The pairwise agreement across the three models in Round 1 sat at 74.7% — high enough that a naive "majority vote" ensemble might have rubber-stamped it.

Round 2: The Red Team Strikes

This is where the adversarial structure earns its cost. Claude Opus 4.8 was not asked to write a better brief — it was asked to break Gemini's. It opened with a warning that doubles as the thesis of this entire article:

Model A is well-structured and superficially authoritative, which makes its embedded errors more dangerous to a reader.

Then it went straight for the invented appeal, rating it a critical failure:

This is almost certainly factually wrong and fabricated to fill the "unsettled" requirement. ... There is no record of Fifth Circuit merits resolution plus a granted SCOTUS cert petition. Model A invented a procedural posture. Because the brief's entire "unsettled" flag rests on this invented appeal, the conclusion that federal law "remains highly unsettled" may be affirmatively false.

Crucially, Opus did not just flag the error — it supplied the real procedural history from memory: the FTC appealed Ryan to the Fifth Circuit (No. 24-10951), moved to hold the appeals in abeyance, and ultimately stepped back from defending the rule. It even named the stakes in business terms:

A reader advised to wait for "Supreme Court resolution" could make a costly business decision on fiction.

That is the difference between a hallucination caught and a hallucination shipped.

Round 2, Continued: The White Team Adjudicates

A Red Team that cries wolf is as useless as a Blue Team that fabricates. So DeepSeek-V4-Pro, as the White Team, weighed the attack on its merits rather than taking it at face value:

Model A's brief ... states the FTC's appeal is "pending before the U.S. Supreme Court." ... the brief makes a concrete factual claim that could be wrong if the administration changed course. The attack correctly flags that the brief did not account for the possibility of an agency retreat.

The adjudicator confirmed the catch was valid and material — not a nitpick. With the fabrication independently identified by the attacker and ratified by the judge, the disputed fact could no longer survive into the final answer as settled truth.

By the end of Round 3, agreement had climbed to 81.9% — but, importantly, it converged on the corrected picture, not the original confident error.

Round Mean Pairwise Agreement Trend
Round 1 74.7%
Round 2 79.9% ▲ +5.2
Round 3 81.9% ▲ +2.0

The Synthesis: Fiction Becomes a Footnote

GPT-5.5, the arbiter, had to reconcile a confident-but-wrong Blue Team brief with a Red Team that had torn a hole in it. It did not simply pick a side. It rewrote the contested claim as an explicit, honest uncertainty — exactly what a careful lawyer would do:

## Legal Brief: Enforceability of Hourly-Worker Non-Competes in the United States as of 2026

As of 2026, non-compete agreements for hourly workers are **not categorically banned under federal law** because the FTC's 2024 nationwide non-compete rule is not in effect. The controlling federal action is the FTC's **Non-Compete Clause Rule**, 89 Fed. Reg. 38342 (May 7, 2024), which sought to prohibit nearly all employment non-competes nationwide. Under the best current account in the provided materials, the rule was set aside before implementation, and later FTC appeals were reportedly abandoned in 2025; however, the supplied responses conflict on whether appeals remained pending or were dismissed, so that procedural point should be verified.

The leading decision is **Ryan, LLC v. FTC**, 746 F. Supp. 3d 369, No. 3:24-cv-00986, 2024 WL 3879954 (N.D. Tex. Aug. 20, 2024). The court entered final summary judgment setting aside the FTC rule nationwide under the Administrative Procedure Act. It held that the FTC exceeded its statutory authority under Section 6(g) of the FTC Act, and that the rule was arbitrary and capricious.

A contrary preliminary ruling issued in **ATS Tree Services, LLC v. FTC**, No. 2:24-cv-01453, 2024 WL 3511630 (E.D. Pa. July 23, 2024), where the court denied a preliminary injunction. That decision, however, was preliminary, non-final, and reportedly later voluntarily dismissed, so it does not outweigh the final nationwide vacatur in Ryan.

Notice the phrase "that procedural point should be verified." The fabricated Supreme Court appeal — Gemini's most confident claim — did not make it into the final brief as fact. It was demoted to a flagged uncertainty. The fiction became a footnote.

Equally telling: every case the final brief does assert is real. Ryan, LLC v. FTC, ATS Tree Services, LLC v. FTC, and Properties of the Villages, Inc. v. FTC are all genuine 2024 challenges to the FTC rule, cited with accurate courts and dates. The ensemble did not just remove the bad — it preserved the good.

The Council of AI Judges

After the run, we scored the outputs with two independent, cheap judges — Gemini 3.5 Flash and Gemini 3.1 Flash-Lite — working from anonymized transcripts. The headline number:

Output Average Score
Synthesized final brief 8.4 / 10
Average individual raw output 4.8 / 10

The synthesized answer scored nearly double the average raw output. That gap is the entire value proposition of orchestration in one statistic: the ensemble's product is meaningfully better than what any single model handed in.

An honest caveat on the per-model scores. In Red Team / Blue Team, only the Blue Team produces an actual brief. The Red Team's output is an attack and the White Team's is a judgment — so when judges score those role-specific outputs as if they were standalone answers to the prompt, Opus and DeepSeek score low (a 1-2/10) simply because a critique is not a brief. Those numbers measure role, not capability. The meaningful comparison is synthesized (8.4) vs. average raw output (4.8) — and, qualitatively, the fact that Opus did the single most valuable thing in the entire session.

Performance & Cost

Metric Gemini 3.5 Flash Claude Opus 4.8 DeepSeek-V4-Pro
Role Blue (Proposer) Red (Attacker) White (Judge)
Total Cost $0.1353 $0.4245 $0.0248
Total Time 60s 106s 110s
Output Length ~10,500 chars ~13,000 chars ~11,800 chars

Add the GPT-5.5 arbiter ($0.0846, 11s) and the full three-round session lands at Approx. $1.25 and roughly 9.4 minutes for 106,595 tokens.

The cost structure tells its own story. Claude Opus 4.8 — the attacker — was the most expensive single component at $0.42, roughly 17× the cost of DeepSeek's adjudication. That is the right place to spend. The hardest, highest-value job in adversarial verification is finding the error that everyone else missed, and that is exactly the seat we gave to the strongest model. Pairing a premium attacker with budget proposer and judge models is a deliberate, economical way to buy reliability.

The Verdict

🏆 The winner is the structure, not a single model.

No individual model "won" here — and that is the point. Gemini 3.5 Flash was fast and articulate but fabricated a critical fact. Claude Opus 4.8 was the indispensable skeptic, the one model whose contribution changed the outcome. DeepSeek-V4-Pro was the cheap, level-headed referee that kept the attack honest. GPT-5.5 turned the conflict into a defensible document.

Strategic Takeaway: A June 2026 wave of research argues that cross-model verification, not bigger single models, is the cheapest path to enterprise-grade reliability. This run is a concrete, reproducible instance of that thesis. The fabrication that one frontier model produced was caught and neutralized — for about a dollar — not because a smarter model was used, but because a second and third model were given adversarial roles. In regulated, high-stakes work, that architecture is the difference between a flagged uncertainty and a costly decision made on fiction.

Try It Yourself

  1. Open the AI Crucible dashboard.
  2. Select three models with distinct strengths (a fast proposer, a strong reasoner, a cheap judge).
  3. Choose the Red Team / Blue Team strategy and set 3 rounds.
  4. Turn web search off if you want to stress-test the models' own knowledge.

Suggested prompt variation:

Write a concise brief (about 350 words) on [a fast-moving regulatory or
factual question in your domain]. Cite the controlling authorities with
full references, state each holding/finding, and flag any point where the
facts are currently unsettled.

Explore the Debate: Read the full 3-round session and inspect the raw model outputs yourself in the Shared Chat Session.

Real-World Context

The Litigation Behind the Brief

Why Multi-Model Verification Matters

Further Reading