AI Crucible Evaluations: Implementation Guide for Multi-Judge Analysis

AI Crucible's evaluation system compares model outputs using sophisticated judging methodologies. This guide covers the implementation details, configuration options, and best practices for getting accurate, actionable evaluation results.

Time to read: 15 minutes Topics covered: Evaluation modes, judge configurations, dashboard features, best practices


What is the AI Crucible evaluation system?

AI Crucible evaluations analyze model responses through automated judging. The system supports two evaluation modes (side-by-side and pointwise), multiple judge configurations, and weighted consensus scoring. Evaluations are saved automatically and accessible through a centralized dashboard for review and analysis.

The evaluation process happens after your ensemble run completes. You can configure evaluation settings before running your chat or apply them retroactively to past sessions.


How do evaluation modes work?

AI Crucible supports two distinct evaluation methodologies: side-by-side (standard) and pointwise. You can toggle between these modes by clicking the settings icon (sliders) in the evaluation panel.

What is side-by-side evaluation?

Side-by-side evaluation presents all responses simultaneously to judge models for direct comparison. Judges explicitly rank responses relative to each other, identifying the best answer through competitive analysis.

Use cases:

Evaluation Mode Selection: Side-by-Side Option Selected

Sample evaluation results:

For a prompt asking three models (Gemini 3 Pro, Kimi K2.5, Claude Opus 4.5) to design a distributed caching strategy:

Model Score Reasoning Summary
Gemini 3 Pro 9.5/10 Highly actionable, concrete patterns like Hospital Queue
Kimi K2.5 9.5/10 Strong lifecycle focus, aligns with senior engineering intuition
Claude Opus 4.5 9.1/10 Sovereign cells concept is sound but less prescriptive

Side-by-Side Evaluation Results


What is pointwise evaluation?

Pointwise evaluation assesses each response independently without comparison. Judges evaluate quality against absolute standards rather than relative performance. Multiple responses can receive the same score.

Use cases:

Evaluation Mode Selection: Pointwise Option Selected

Sample evaluation results:

Same caching prompt, pointwise mode with Gemini 3 Flash as judge:

Model Score Independent Assessment
Gemini 3 Pro 9.4/10 Excellent detail on isolation patterns
Kimi K2.5 9.3/10 Strong operational lifecycle focus
Claude Opus 4.5 9.0/10 Good theoretical architecture

Key difference: Scores are NOT relative. A response can score 9/10 even if another response is superior—each is judged against absolute quality standards.

Pointwise Evaluation Results


How do I configure judges for evaluations?

AI Crucible supports single-judge and multi-judge configurations. The number of judges affects cost, evaluation time, and result reliability.

When should I use a single judge?

Single-judge evaluations use one model to evaluate all responses. This approach minimizes cost and latency while maintaining consistent scoring perspective.

Benefits:

Limitations:

Best for: Routine evaluations, cost-sensitive workflows, situations requiring consistent perspective

[!TIP] Access advanced configuration by clicking the sliders icon next to the judge selector.

Single Judge Configuration


When should I use multiple judges?

Multi-judge evaluations use 2+ models to assess responses independently. Scores are aggregated using weighted or arithmetic mean consensus.

Multi-Judge Configuration

Weighted consensus scoring:

When enabled via the settings menu (sliders icon), judges who assign higher scores receive more influence on the consensus. This approach assumes higher confidence correlates with better judgment quality.


How does the evaluations dashboard work?

The Evaluations Dashboard provides centralized access to all historical evaluations with filtering, tagging, and export capabilities. Access it via Settings > Evaluations History in the navigation menu.

Evaluations Dashboard Main View

What features does the dashboard include?

The dashboard displays up to 50 most recent evaluations with live updates as new evaluations complete.

Key columns:

Actions available:


How do I filter and search evaluations?

The dashboard provides three filtering mechanisms that work together:

Dashboard Filters

Strategy filter: Select from dropdown to show only evaluations from specific strategies (Competitive Refinement, Expert Panel, Collaborative Synthesis, etc.). Default is "All".

Tag filter: Type to search for evaluations containing specific tags. Matches are case-insensitive and search across all tags assigned to each evaluation.

Prompt search: Enter text to search within prompt content. Useful for finding specific evaluations by topic or identifier.

All filters combine using AND logic—evaluations must match all active filters to appear in the results.


How do I tag evaluations for organization?

Tagging provides custom organization beyond built-in filters. Add tags when creating evaluations or retroactively from the dashboard.

Common tag patterns:

Evaluation Tagging Interface

Tags appear in the dashboard table and are included in CSV/JSON exports for downstream processing.


How do I export evaluation data?

Export functionality generates CSV or JSON files for offline analysis, archival, or sharing.

CSV export: Generates a spreadsheet-compatible file with columns for ID, Date, Prompt, Strategy, Method, Weighted status, Judges, Scores, Token usage, Cost, and Tags. Perfect for Excel or Google Sheets analysis.

JSON export: Exports complete evaluation documents including full criteria breakdowns, individual judge reasoning, and all metadata. Ideal for programmatic processing or detailed archival.

Export scope: Exports include only evaluations visible after applying active filters. To export all evaluations, remove all filters first.

Common use cases:


What are the evaluation criteria?

Judges evaluate responses using five standard criteria, each scored 0-10:

Criterion Definition
Accuracy Factual correctness and technical precision
Creativity Novel approaches and innovative thinking
Clarity Communication effectiveness and readability
Completeness Thoroughness and coverage of requirements
Usefulness Practical applicability and actionable value

Each criterion contributes equally (20%) to the overall score. The overall score is the arithmetic mean of all five criteria.

Criteria Breakdown

Strategy-specific criteria:

For team-based strategies like Red Team/Blue Team, Debate Tournament, or Hierarchical, role-specific criteria replace the standard five. For example:


Best Practices

How do I choose the right evaluation method?

Match evaluation mode to your use case:

Use Case Recommended Method Rationale
Selecting best response Side-by-Side Direct comparison highlights strengths
Quality assurance Pointwise Absolute standards, not relative
Benchmarking Side-by-Side + Multi-Judge Reduces bias, provides variance
Cost optimization Side-by-Side + Single Judge Minimizes LLM calls

How many judges should I use?

Judge count affects cost, reliability, and bias:

How do I optimize token usage?

Token optimization strategies:

How do I manage dashboard storage?

The dashboard shows the 50 most recent evaluations for performance. Management strategies:


Frequently Asked Questions

What is the difference between weighted and arithmetic mean scoring?

Weighted scoring gives higher-confidence judges more influence on the consensus score. Judges who assign higher scores receive increased weight based on a quadratic formula. Arithmetic mean treats all judges equally regardless of their individual scores.

Use weighted scoring when judge expertise varies or when you want confident assessments to dominate consensus.

Can I evaluate responses from different strategies?

Yes. Evaluations work with any ensemble strategy output. The evaluation system receives model responses and metadata but doesn't depend on how responses were generated.

You can evaluate Competitive Refinement, Expert Panel, Collaborative Synthesis, or any other strategy outputs using the same evaluation framework.

How long do evaluations take?

Side-by-side evaluations complete in 10-20 seconds for single judge, 30-60 seconds for 3 judges. Pointwise evaluations take 20-40 seconds for single judge, 60-120 seconds for 3 judges evaluating 4 responses.

Time varies based on model selection, response length, and provider latency.

How do I access evaluation settings?

Click the sliders icon next to the judge selection input to open the configuration menu. Here you can toggle Weighted Scoring (for multi-judge setups) and select your Evaluation Method (Side-by-Side vs Pointwise).

Why is my evaluation divergence high?

High divergence (>10%) indicates judge disagreement. Common causes:

High divergence isn't necessarily problematic—it reveals genuine uncertainty. Consider adding more judges or refining evaluation criteria.

Can I rerun evaluations with different judges?

No, evaluations are immutable after creation. To evaluate with different judges, generate a new evaluation from the same chat session. This preserves historical comparisons and prevents data corruption.

Each evaluation creates an independent record in your dashboard.

How do I interpret variance metrics?

Variance measures score spread across judges. Low variance (<0.2) indicates agreement, moderate variance (0.2-0.5) shows some disagreement, high variance (>0.5) reveals significant judge divergence.

Divergence percentage converts variance to 0-100% scale. Aim for <10% divergence for confident consensus results.

What happens to evaluations when I delete a chat?

Evaluations are stored independently from chat sessions. Deleting a chat does not delete associated evaluations—they remain accessible in the dashboard. However, the navigation link to the original chat will no longer work.

To fully remove evaluation data, delete evaluations individually from the dashboard.

Can I evaluate chats created before evaluation features existed?

Yes, but only if the chat session preserved model responses. Retroactive evaluation generates new judge assessments for historical data. Navigate to the chat, click "Evaluate", and configure judges as normal.

Older chats that didn't save individual model responses cannot be evaluated.


Related Articles