A 200-page report does not fit comfortably in a prompt. Stuff it in whole and you pay for every page, slow the run down, and bury the relevant passage in noise. AI Crucible now treats large PDFs the way a careful researcher would: index the document once, then search it for exactly what each question needs, with page numbers behind every claim.
We put it to work on a real document, and the run told a sharper story than we expected. One frontier model fabricated the numbers even with the PDF indexed. The ensemble, the page citations, and a panel of judges caught it.
Time to read: 9-11 minutes.
Session cost: Approx. $0.057 to run (three models + a Gemini Flash-Lite arbiter); about $0.11 including the evaluation.
| Parameter | Value |
|---|---|
| Strategy | Expert Panel |
| Rounds | 1 |
| Web Search | Disabled (document-only) |
| Arbiter | Gemini 3.1 Flash-Lite |
| Models | Gemini 3.1 Pro, Kimi K2.6, Qwen3.5 Plus |
| Attachment | An 11-page, 3.6 MB statistical-methods PDF |
Small PDFs are unchanged. Anything under 1.5 MB still goes straight into the model. Above that threshold, AI Crucible switches to retrieval.
When you attach a large PDF, we:
text-embedding-3-small and index them.The result: the document never floods the context window, and the models retrieve only the passages a given question needs.
Three tools appear automatically when a large PDF is in the run. They are scoped to the documents indexed in that run, so a model cannot reach into other conversations.
| Tool | What it does |
|---|---|
pdf_search |
Semantic search across the document. Returns ranked passages with page numbers. |
pdf_list_docs |
Lists the large PDFs attached to the conversation, with their IDs. |
pdf_get_pages |
Returns the verbatim text of specific pages (up to 10 per call) for exact quotes. |
A typical flow: the model runs pdf_search to find candidate passages, then pdf_get_pages to read the exact wording before it commits to a number. Every figure it reports can carry a page citation.
Scanned or image-only PDFs have no extractable text, so they cannot be indexed. Those no longer fail silently — you get an explicit notice that the file had no readable text.
We attached a real 11-page PDF — "Statistical Methods Behind the Site," which documents how a Bulgarian-elections analytics site computes its published figures (polling-accuracy scores, Benford screens, risk indices). At 3.6 MB it crossed the threshold and was RAG-indexed.
We ran an Expert Panel: three models answer independently as distinct personas (a document analyst, a concise explainer, and a numbers auditor), then the arbiter synthesizes one answer.
The prompt: Using only this document, answer these and cite the page number for every figure: 1) How is the polling-accuracy leaderboard score computed? Give the exact shrinkage formula and the value of k. 2) In the Market Links worked example, what are the grade, raw MAE, shrunk MAE, and 4% call rate? 3) What does the Composite Risk Index produce, and how many tracks does it have?
Every question has one verifiable answer in the document, so this is a clean test of retrieval, not opinion.
| Model | Persona | Strength |
|---|---|---|
| Gemini 3.1 Pro | Document analyst | Fastest responder (14.7s). |
| Kimi K2.6 | Concise explainer | Shortest, sharpest answer. |
| Qwen3.5 Plus | Numbers auditor | Most thorough, quotes the source verbatim. |
Kimi was slow (71 seconds) but the most precise. It gave the formula and the constant exactly:
score = ( n · adjusted MAE + k · field mean ) / ( n + k ), k = 4
It pulled the Market Links scorecard straight from page 4 — Grade A+, Raw MAE 1.67, Shrunk MAE 0.97 — in a tidy table. Shortest response in the run, and the most accurate.
Qwen leaned into its auditor persona, opening with a "Numbers Auditor Report" and quoting the document verbatim for each figure:
Page 4: "GRADE A+ composite"
It confirmed k = 4, the A+/1.67/0.97 scorecard, and the Composite Risk Index's two tracks, citing pages throughout.
Gemini answered first and started well — it reproduced the shrinkage formula correctly. Then it drifted into fabrication. It reported the value of k as 25 (the document says 4), gave the Market Links example as Grade B/C, Raw MAE 3.42, Shrunk MAE 3.85, 82% (the document says A+, 1.67, 0.97, 90%), and claimed the Composite Risk Index had 3 tracks (it has 2). It even cited pages for these numbers — pages that do not contain them.
This is the failure mode RAG is supposed to prevent, and mostly does: a confident, well-formatted, wrong answer. The point of this article is what happened next.
Two low-cost judges — Gemini 3.5 Flash and Gemini 3.1 Flash-Lite — scored the anonymized answers. They were not told which model wrote what.

| Answer | Overall | Accuracy |
|---|---|---|
| Kimi K2.6 🏆 | 9.4 | 10/10 |
| Gemini 3.1 Flash-Lite (synthesis) | 9.1 | 10/10 |
| Qwen3.5 Plus | 8.8 | 10/10 |
| Gemini 3.1 Pro | 2.8 | 2/10 |
The judges did exactly what you would hope. The two document-grounded answers and the synthesis scored 9 and above; Gemini 3.1 Pro's fabricated figures earned it 2.8 overall and 2/10 on accuracy. Both judges agreed independently — they scored Gemini 3.1 Pro 3.2 and 2.4, while giving Kimi a 9.4 each. The fabrication had nowhere to hide once the answers were judged against the citable source.
The arbiter (Gemini 3.1 Flash-Lite) read all three answers and produced a single, fully page-cited result — and it sided with the document, not with the loudest model:
### 1) Polling-Accuracy Leaderboard Score Computation
- Exact Shrinkage Formula: score = ( n · adjusted MAE + k · field mean ) / ( n + k ) (Page 3)
- Value of k: 4 (Page 3)
### 2) Market Links Worked Example
- Grade: A+ composite (Page 4)
- Raw MAE: 1.67 (Page 4)
- Shrunk MAE: 0.97 (Page 4)
- 4% call rate: 90% (Page 4)
### 3) Composite Risk Index
The Composite Risk Index produces one election-level 0–100 headline score that
summarizes process-integrity signals (Page 2, Page 8). It is built in 2 separated tracks:
- Integrity track: 5 vote-weighted signals averaged into the headline (Page 8).
- Context track: 5 signals shown alongside, never folded into the score (Page 8).
Every figure here matches the source, and every figure carries a page you can open and check.
| Metric | Gemini 3.1 Pro | Kimi K2.6 | Qwen3.5 Plus |
|---|---|---|---|
| Cost | $0.026 | $0.013 | $0.016 |
| Time | 14.7s | 70.8s | 46.6s |
| Response length | ~2,700 chars | ~1,360 | ~4,140 |
| Eval (overall) | 2.8 | 9.4 | 8.8 |
Similarity matrix
| Pair | Similarity |
|---|---|
| Kimi ↔ Qwen | 0.91 |
| Gemini ↔ Kimi | 0.84 |
| Gemini ↔ Qwen | 0.83 |
The two correct answers were the most similar to each other (0.91); Gemini's outlier sat lower against both. Aggregate agreement landed at 86%.
🏆 Kimi K2.6 takes it — slowest, shortest, and exactly right, with clean page citations. Qwen3.5 Plus was a close second and the most rigorous about quoting the source.
The real lesson is about the system, not the winner. Retrieval-augmented PDFs make a long document queryable and cheap to reason over — but they do not make any single model infallible. Gemini 3.1 Pro had the same indexed text and still invented numbers. What saved the answer was the ensemble (two models retrieved correctly), the judges (they scored the fabrication 2/10), and the page citations (you can verify every figure in seconds).
Strategic takeaway: for documents you actually rely on, do not trust a lone model's confident summary. Run a panel, demand page citations, and let the judges flag the outlier. The cost of that safety net here was about eleven cents.
Using only this document, answer these and cite the page number for every
figure: [your questions]. Cite the page number for every figure.
Explore the run: Read the full panel, the judges' scores, and the raw model outputs in the Shared Chat Session.