AI Crucible hosts about twenty models natively. OpenRouter hosts hundreds. The new Connect tier bridges the two: bring your own OpenRouter key, add any model by its ID, and run it inside the same ensemble strategies as our built-in models. Your model usage is billed by OpenRouter, not by us.
To prove the point, we ran an ensemble made entirely of OpenRouter models that AI Crucible does not host natively. The metered cost on our side was $0 for every model response.
Time to read: 8-10 minutes.
Session cost: Approx. $0.003 to run (the Gemini arbiter only); both OpenRouter models cost $0.00. Adding two judge models for the evaluation brought the metered total to about $0.02.
| Parameter | Value |
|---|---|
| Strategy | Competitive Refinement |
| Rounds | 2 |
| Web Search | Disabled |
| Arbiter | Gemini 3.1 Flash-Lite |
| Models | Cohere Command R+ and NVIDIA Nemotron 49B (both OpenRouter, your key) |
Connect is a $5 per month plan built around your own OpenRouter key. It is the cheapest tier we offer, and it changes the economics of multi-model runs.
| Plan | Price | Model usage | Token grant |
|---|---|---|---|
| Connect | $5/mo | Unlimited on your key, billed by OpenRouter | 300k tokens for platform calls |
| Starter | $19/mo | 20+ built-in models, metered | 2M unified tokens |
| Pro | $49/mo | 20+ built-in models, metered | 10M unified tokens |
On Connect, the 300k monthly token grant covers web search and any built-in platform models you mix in. Everything you run on an OpenRouter model is unmetered: the cost column in your results reads $0, because that spend lives in your OpenRouter account.
You still get all seven ensemble strategies, web search, and a standard rate limit of 10 runs per minute.
The setup takes a minute and lives under Settings.
sk-or-v1-....
Once a key is saved, you add models by their OpenRouter slug.
owner/model form, for example cohere/command-r-plus-08-2024 or nvidia/llama-3.3-nemotron-super-49b-v1.5.Use the exact slug from openrouter.ai/models. Ids often carry a version suffix, so a guessed name like nvidia/llama-3.3-nemotron-super-49b is rejected as invalid. Models without tool support still run; they just skip web grounding and other tools.
You can add up to 100 models. Image attachments work on models that support vision; PDF and JSON attachments work on all of them, since PDFs route through OpenRouter's file parser.

To show the tier in action, we ran a Competitive Refinement session: each model answers, sees the other, then refines. We picked two OpenRouter-exclusive models that AI Crucible does not host natively, so the whole ensemble ran unmetered on the OpenRouter key.
The prompt: Design the data model and API surface for a multi-tenant feature-flag service. Cover tenancy isolation, flag-evaluation latency, and a safe rollout and rollback story. Be specific enough to hand to an engineer.
This prompt rewards depth and concrete trade-offs, not retrieval, so it is a fair test of whether OpenRouter models can carry a real design task.
| Model | Role | The Pitch |
|---|---|---|
| Cohere Command R+ (08-2024) | The Structured | Clean schema, methodical tables, $0 on the bill. |
| NVIDIA Nemotron Super 49B V1.5 | The Architect | UUIDs, default flags, sharper trade-offs, also $0. |
Both are OpenRouter models running on your key, so neither consumed any AI Crucible tokens.
Both models produced a full data model and API surface on the first pass, but with different instincts.
Cohere led with a clean relational schema and tenant quotas baked in from the start. It defined quota_limits as (e.g., max flags, requests per minute) and laid out Tenants, Flags, Flag Versions, and Evaluations as separate tables. The draft was thorough but slow — it took 181 seconds, the longest single response in the run.
Nemotron was faster (86 seconds) and reached for trade-offs early. On latency it argued:
Edge Evaluation: Offloading to CDNs reduces latency but limits rule complexity.
It also seeded the schema with default_flags (predefined system flags, e.g., for observability) — an operational touch the prompt never asked for.
At the end of Round 1 the two designs agreed 61%.
In Competitive Refinement, round 2 is where each model reads the other and revises. Both models converged on a shared backbone: immutable FlagVersions with a previous_version_id to make rollback a pointer change rather than a destructive edit.
Nemotron switched its primary keys to UUIDs and kept its observability defaults. Cohere tightened its API surface and adopted the versioned-rollback model. Agreement rose to 74% — visible convergence, not collapse, since each kept its distinctive ideas.
| Pair | Round 1 | Round 2 | Trend |
|---|---|---|---|
| Cohere ↔ Nemotron | 0.61 | 0.74 | Converging |
We then ran an evaluation with two low-cost judges — Gemini 3.5 Flash and Gemini 3.1 Flash-Lite — scoring anonymized transcripts with no model names attached. The arbiter's synthesized answer was scored too.

| Model | Overall | Accuracy | Clarity | Completeness | Creativity | Usefulness |
|---|---|---|---|---|---|---|
| NVIDIA Nemotron 49B 🏆 | 9.3 | 9.5 | 9.5 | 9.5 | 8.0 | 9.5 |
| Cohere Command R+ | 8.7 | 9.2 | 9.0 | 8.5 | 8.5 | 8.0 |
| Gemini 3.1 Flash-Lite (synth) | 7.8 | 9.0 | 8.0 | 6.0 | 6.0 | 6.0 |
The headline: a model AI Crucible does not host natively won the panel outright at 9.3, and even out-scored the synthesizer that combined both answers.
Per-judge agreement. The two judges barely differed on the OpenRouter models. Gemini 3.5 Flash gave Nemotron 9.2 and Cohere 8.64; Gemini 3.1 Flash-Lite gave Nemotron 9.5 and Cohere 8.8. The model-level divergence was just 1.4% for Nemotron and 1.0% for Cohere — the judges clearly agreed on the ranking.
Where they split. The synthesized answer drew the widest spread (8.0% divergence). Both judges marked it down on completeness and creativity: the merge captured the structure of both designs but lost some of the specificity that made the originals strong.
This is the headline of the whole tier. Watch the cost column.
| Metric | Cohere Command R+ | NVIDIA Nemotron 49B |
|---|---|---|
| Total cost (us) | $0.00 | $0.00 |
| R1 Time | 181s | 86s |
| R2 Time | 60s | 58s |
| Tokens used | ~4,982 | ~8,660 |
| Response length | ~5,386 chars (R2) | ~5,983 chars (R2) |
Both models cost $0 in AI Crucible. Their real spend appears in your OpenRouter dashboard, where you control the budget. The only metered charge in the whole run was the Gemini arbiter, at about $0.003.
🏆 NVIDIA Nemotron Super 49B takes it. It was faster than Cohere in both rounds, raised the sharpest trade-offs, and the judges ranked it first on every criterion except creativity, where the two tied.
Cohere Command R+ is the steadier structuralist — clean tables and explicit quotas — but its 181-second first round and slightly thinner trade-offs left it second. Still an 8.7, and entirely free on our side.
Strategic takeaway: Connect turns model choice into a budgeting decision rather than a platform limit. If a model exists on OpenRouter, you can put it in a Crucible ensemble for the flat $5, pay its real cost on your own key, and — as this run shows — it can beat the field.
# Multi-Tenant Feature-Flag Service Design
## Data Model
- **Tenants**: tenant_id (PK), name, api_key, quota_limits.
- **Flags**: flag_id (PK), tenant_id (FK), key (unique), name, current_version_id (FK).
- **FlagVersions**: version_id (PK), flag_id (FK), previous_version_id (for rollback), rollout_strategy (JSON), status (active/draft/rolled_back).
- **Variations**: variation_id (PK), flag_id (FK), name, value (JSON).
- **TargetingRules**: rule_id (PK), version_id (FK), condition (JSON), variation_id (FK).
- **AuditLogs**: log_id (PK), tenant_id, action, timestamp, details (JSON).
## API Surface
- **Evaluation**: POST /tenants/{tenant_id}/flags/{flag_id}/evaluate
- **Rollout/Rollback**: POST .../rollout (start/modify), POST .../rollback (revert to a specific version).
- **Audit**: GET /tenants/{tenant_id}/flags/{flag_id}/audit.
## Tenancy Isolation
- **Database Sharding**: separate databases per tenant for strict isolation.
- **Row-Level Security (alternative)**: a centralized database with per-row tenant policies.
Design the data model and API surface for a multi-tenant feature-flag
service. Cover tenancy isolation, flag-evaluation latency, and a safe
rollout and rollback story. Be specific enough to hand to an engineer.
Explore the run: Read the full 2-round session and the raw model outputs in the Shared Chat Session.