Chinese Flagship Showdown: Qwen3.7-Max vs Kimi K2.6 vs DeepSeek V4

China's three leading labs each shipped a new flagship this season. Alibaba's Qwen3.7-Max is a native extended-thinking model that leads SWE-Pro and Terminal-Bench. Moonshot's Kimi K2.6 is a multimodal agentic model with deep reasoning. DeepSeek's V4-Pro brings dual thinking and non-thinking modes at a famously low price.

So which one is actually best? We put all three through one demanding system-design challenge and scored them with two neutral Western judges. No cherry-picking. Just a hard, ambiguous problem that separates real engineering depth from confident hand-waving.

The contenders:

Qwen3.7-Max — Alibaba's flagship, 1M context, native extended-thinking
Kimi K2.6 — Moonshot's multimodal agentic model, 256K context, deep reasoning
DeepSeek-V4-Pro — DeepSeek's most capable V4 model, dual thinking modes, 1M context

Time to read: 7–9 minutes

The Benchmark: Real-Time Fraud Detection at 50,000 TPS

We gave all three models the same overloaded prompt using the Competitive Refinement strategy, run as a single first-pass round. The task demands breadth, mathematical depth, and hard judgment calls:

"Design a real-time payment fraud detection system for a fintech processing 50,000 transactions per second across 40 countries." The full prompt added seven requirements: sub-100ms p99 latency, evolving fraud with minimal labels and ~0.1% class imbalance, a streaming-plus-historical feature pipeline, regulator-grade explainability, a delayed-label feedback loop, adversarial defense, and a cost framework for the false-positive trade-off.

This prompt is deliberately punishing. It demands breadth (seven subsystems), depth (concrete math, not buzzwords), judgment (real trade-offs), and structure (a worked numerical example). Weak models produce generic advice here. Strong ones produce a blueprint.

Parameter	Value
Strategy	Competitive Refinement
Rounds	1
Web Search	Disabled
Arbiter	Grok 4.3
Models	Qwen3.7-Max, Kimi K2.6, DeepSeek-V4-Pro

How We Scored It

Every response was independently evaluated by two AI judges from different providers — Gemini 3.1 Pro and Claude Opus 4.8. Using neutral, non-Chinese judges removes home-team bias when scoring Chinese models. The dual-judge consensus also catches mistakes a single model would miss.

Criterion	What It Measures
Accuracy	Correctness of architecture, math, and compliance claims
Clarity	Structure, readability, navigability
Completeness	Coverage of all seven requirements plus the worked example
Creativity	Novel patterns and original framing
Usefulness	Could an engineer build from this spec?

The Results

Final Consensus Scores

Model	Consensus	Gemini 3.1 Pro	Claude Opus 4.8
🥇 Qwen3.7-Max	9.5 / 10	10.0	8.8
🥈 DeepSeek-V4-Pro	7.6 / 10	7.6	7.7
🥉 Kimi K2.6	6.2 / 10	6.2	6.2

The new Qwen flagship won decisively. Gemini 3.1 Pro handed it a perfect 10 across every criterion. Claude Opus 4.8 was more reserved at 8.8, but still ranked it first. Both judges produced the exact same ordering — and even landed on an identical 6.2 for Kimi K2.6.

Criterion-Level Breakdown

Criterion	Qwen3.7-Max	DeepSeek-V4-Pro	Kimi K2.6
Accuracy	10 / 9.0	8.5 / 8.0	8.0 / 7.5
Clarity	10 / 9.0	8.0 / 8.0	8.5 / 7.5
Completeness	10 / 9.5	6.0 / 6.5	2.0 / 3.0
Creativity	10 / 7.5	8.5 / 8.5	9.5 / 8.5
Usefulness	10 / 9.0	7.0 / 7.5	3.0 / 4.5

Scores shown as Gemini Judge / Claude Judge.

The completeness column tells the story of the whole benchmark. Look at Kimi: top-tier creativity, bottom-tier completeness.

What Each Model Revealed

Qwen3.7-Max: The Master-Class Architect

Qwen3.7-Max produced the most complete answer by a wide margin — 4,965 words of dense, math-grounded design. It framed the system as a cascaded "Fast and Slow" architecture inspired by dual-process cognition. Cheap filters clear obvious traffic; expensive graph models only fire on borderline cases.

The depth was the differentiator. It posed the problem as Positive-Unlabeled learning with a correctly stated unbiased risk estimator. It modeled weeks-late labels with a Weibull delay distribution and soft-label cross-entropy. It derived the Bayes-optimal cost-sensitive threshold, then split it per customer segment. Gemini's judge called it an "outstanding, master-class response."

Best for: Teams that need exhaustive, mathematically rigorous designs they can build from directly.

Kimi K2.6: Brilliant, But Unfinished

Kimi K2.6 was the most creative response in the field. It rejected the "standard XGBoost + Kafka + SHAP playbook" for a "multi-velocity liquid system." Its standout idea: a Hedge tier that exploits post-authorization reversals to buy 500ms on the hardest 2% of transactions. It paired a Mamba state-space trunk with a differentiable neural dictionary for instant adaptation.

Then it ran out of room. At just 584 words, the answer cut off before covering most of the seven requirements. Gemini's judge flagged that the response "cuts off extremely early." That gap explains the brutal completeness scores of 2 and 3 out of 10. The reasoning budget consumed the answer.

Best for: Ideation and novel framing, when you want a creative spark rather than a finished spec.

DeepSeek-V4-Pro: The Value Champion

DeepSeek-V4-Pro landed in the middle on quality — but it is the story of the benchmark on cost. Its design was a genuinely distinct gated ensemble: an unsupervised autoencoder, a cost-weighted LightGBM, and an FTRL online learner, combined by a contextual-bandit meta-learner trained with policy gradient on delayed rewards. Claude's judge called it "the most original handling" of the delayed-label problem.

It delivered roughly 80% of Qwen's score. It did so at one-ninth of the cost and the fastest runtime in the group. For high-volume workloads, that trade is hard to argue with.

Best for: Cost-sensitive production at scale, where "very good and cheap" beats "perfect and pricey."

The Cost and Speed Reality Check

The scoreboard rewards Qwen. The invoice rewards DeepSeek.

Model	Words	Time	Cost per run	vs DeepSeek
Qwen3.7-Max	4,965	232s	$0.1345	9.4× pricier
Kimi K2.6	584	307s	$0.0397	2.8× pricier
DeepSeek-V4-Pro	1,675	200s	$0.0143	Baseline

Qwen3.7-Max won, but it cost 9.4× more than DeepSeek-V4-Pro and ran the most output tokens. Kimi was the slowest of the three despite the shortest answer, because its reasoning trace ate most of its output budget. The whole three-model run, including both judges, cost just $0.38 total.

For one fraud-detection design, the premium is trivial. Scaled to thousands of design reviews a month, the gap between Qwen and DeepSeek becomes a real budget line — and DeepSeek's quality stays well above "good enough."

How Different the Answers Were

AI Crucible measures pairwise semantic similarity between responses. Higher numbers mean the models converged on similar designs.

Pair	Similarity
Qwen3.7-Max ↔ DeepSeek-V4-Pro	0.77
Kimi K2.6 ↔ DeepSeek-V4-Pro	0.74
Qwen3.7-Max ↔ Kimi K2.6	0.66

Kimi was the outlier. Its "liquid system" framing diverged most from the others, which is exactly what you expect from the highest-creativity, lowest-completeness answer. Round agreement across all three landed at 72.2%.

Choosing the Right Chinese Flagship

Choose Qwen3.7-Max if you want the deepest, most complete, most rigorous output and can absorb a higher per-call cost. It is the model to reach for on hard design problems where correctness and coverage matter more than price.

Choose DeepSeek-V4-Pro if cost efficiency is paramount or you run high volumes. It delivered about 80% of the winner's quality at one-ninth of the cost, with original ideas of its own.

Choose Kimi K2.6 if you want creative, unconventional framing and idea generation. Just budget enough output tokens, or its long reasoning trace can starve the final answer.

🔗 Explore the full Competitive Refinement session →

Frequently Asked Questions

What is the best Chinese AI model in June 2026?

On our fraud-detection benchmark, Qwen3.7-Max was the clear winner at 9.5/10, judged by Gemini 3.1 Pro and Claude Opus 4.8. DeepSeek-V4-Pro followed at 7.6, and Kimi K2.6 at 6.2. The "best" model still depends on your priority: Qwen for depth, DeepSeek for value, Kimi for creativity. Results vary by task, so validate on your own prompts.

How does Qwen3.7-Max compare to DeepSeek-V4-Pro?

Qwen3.7-Max scored higher on every criterion, especially completeness (9.8 vs 6.2) and usefulness. But DeepSeek-V4-Pro cost 9.4× less per run and was the fastest model in the test. DeepSeek delivered roughly 80% of Qwen's quality at about 11% of the cost, making it the value pick for high-volume work.

Why did Kimi K2.6 score lowest despite being creative?

Kimi K2.6 earned the highest creativity scores but the lowest completeness scores (2 and 3 out of 10). Its answer was only 584 words and cut off before covering most requirements. Its long internal reasoning trace consumed the output token budget, leaving little room for the final design. Both judges independently caught the gap.

Which judges scored the models, and why use two?

Gemini 3.1 Pro and Claude Opus 4.8 judged the anonymized answers. Using two judges from different providers reduces single-model bias and home-team favoritism. Here both judges produced the identical ranking, which raises confidence in the verdict. Strong, diverse judges catch polished-but-incomplete answers that one judge might reward.

How much does it cost to run a comparison like this on AI Crucible?

This three-model Competitive Refinement run with dual-judge evaluation cost about $0.38 total. Individual model costs ranged from $0.0143 for DeepSeek-V4-Pro to $0.1345 for Qwen3.7-Max. A typical multi-model comparison runs under $1, making rigorous benchmarking accessible for any team.

Ready to benchmark these models on your own prompts? Start a free comparison on AI Crucible →