Analyze Large PDFs: Page-Cited Search Across Long Documents

A 200-page report does not fit comfortably in a prompt. Stuff it in whole and you pay for every page, slow the run down, and bury the relevant passage in noise. AI Crucible now treats large PDFs the way a careful researcher would: index the document once, then search it for exactly what each question needs, with page numbers behind every claim.

We put it to work on a real document, and the run told a sharper story than we expected. One frontier model fabricated the numbers even with the PDF indexed. The ensemble, the page citations, and a panel of judges caught it.

Time to read: 9-11 minutes.

Session cost: Approx. $0.057 to run (three models + a Gemini Flash-Lite arbiter); about $0.11 including the evaluation.

Parameter Value
Strategy Expert Panel
Rounds 1
Web Search Disabled (document-only)
Arbiter Gemini 3.1 Flash-Lite
Models Gemini 3.1 Pro, Kimi K2.6, Qwen3.5 Plus
Attachment An 11-page, 3.6 MB statistical-methods PDF

What changes for a large PDF?

Small PDFs are unchanged. Anything under 1.5 MB still goes straight into the model. Above that threshold, AI Crucible switches to retrieval.

When you attach a large PDF, we:

  1. Extract its text page by page, preserving page boundaries.
  2. Split each page into overlapping chunks of about 2,400 characters.
  3. Embed every chunk with text-embedding-3-small and index them.
  4. Remove the raw PDF from the prompt and hand the models search tools instead.

The result: the document never floods the context window, and the models retrieve only the passages a given question needs.


What tools do the models get?

Three tools appear automatically when a large PDF is in the run. They are scoped to the documents indexed in that run, so a model cannot reach into other conversations.

Tool What it does
pdf_search Semantic search across the document. Returns ranked passages with page numbers.
pdf_list_docs Lists the large PDFs attached to the conversation, with their IDs.
pdf_get_pages Returns the verbatim text of specific pages (up to 10 per call) for exact quotes.

A typical flow: the model runs pdf_search to find candidate passages, then pdf_get_pages to read the exact wording before it commits to a number. Every figure it reports can carry a page citation.

Scanned or image-only PDFs have no extractable text, so they cannot be indexed. Those no longer fail silently — you get an explicit notice that the file had no readable text.


The scenario: three analysts, one methods document

We attached a real 11-page PDF — "Statistical Methods Behind the Site," which documents how a Bulgarian-elections analytics site computes its published figures (polling-accuracy scores, Benford screens, risk indices). At 3.6 MB it crossed the threshold and was RAG-indexed.

We ran an Expert Panel: three models answer independently as distinct personas (a document analyst, a concise explainer, and a numbers auditor), then the arbiter synthesizes one answer.

The prompt: Using only this document, answer these and cite the page number for every figure: 1) How is the polling-accuracy leaderboard score computed? Give the exact shrinkage formula and the value of k. 2) In the Market Links worked example, what are the grade, raw MAE, shrunk MAE, and 4% call rate? 3) What does the Composite Risk Index produce, and how many tracks does it have?

Every question has one verifiable answer in the document, so this is a clean test of retrieval, not opinion.

View the full chat here

The panel

Model Persona Strength
Gemini 3.1 Pro Document analyst Fastest responder (14.7s).
Kimi K2.6 Concise explainer Shortest, sharpest answer.
Qwen3.5 Plus Numbers auditor Most thorough, quotes the source verbatim.

The answers — and the one that wasn't

Kimi K2.6

Kimi was slow (71 seconds) but the most precise. It gave the formula and the constant exactly:

score = ( n · adjusted MAE + k · field mean ) / ( n + k ), k = 4

It pulled the Market Links scorecard straight from page 4 — Grade A+, Raw MAE 1.67, Shrunk MAE 0.97 — in a tidy table. Shortest response in the run, and the most accurate.

Qwen3.5 Plus

Qwen leaned into its auditor persona, opening with a "Numbers Auditor Report" and quoting the document verbatim for each figure:

Page 4: "GRADE A+ composite"

It confirmed k = 4, the A+/1.67/0.97 scorecard, and the Composite Risk Index's two tracks, citing pages throughout.

Gemini 3.1 Pro

Gemini answered first and started well — it reproduced the shrinkage formula correctly. Then it drifted into fabrication. It reported the value of k as 25 (the document says 4), gave the Market Links example as Grade B/C, Raw MAE 3.42, Shrunk MAE 3.85, 82% (the document says A+, 1.67, 0.97, 90%), and claimed the Composite Risk Index had 3 tracks (it has 2). It even cited pages for these numbers — pages that do not contain them.

This is the failure mode RAG is supposed to prevent, and mostly does: a confident, well-formatted, wrong answer. The point of this article is what happened next.


The Council of AI Judges

Two low-cost judges — Gemini 3.5 Flash and Gemini 3.1 Flash-Lite — scored the anonymized answers. They were not told which model wrote what.

Evaluation scores: Kimi 9.4, synthesizer 9.1, Qwen 8.8, Gemini 3.1 Pro 2.8

Answer Overall Accuracy
Kimi K2.6 🏆 9.4 10/10
Gemini 3.1 Flash-Lite (synthesis) 9.1 10/10
Qwen3.5 Plus 8.8 10/10
Gemini 3.1 Pro 2.8 2/10

The judges did exactly what you would hope. The two document-grounded answers and the synthesis scored 9 and above; Gemini 3.1 Pro's fabricated figures earned it 2.8 overall and 2/10 on accuracy. Both judges agreed independently — they scored Gemini 3.1 Pro 3.2 and 2.4, while giving Kimi a 9.4 each. The fabrication had nowhere to hide once the answers were judged against the citable source.


The synthesis got it right

The arbiter (Gemini 3.1 Flash-Lite) read all three answers and produced a single, fully page-cited result — and it sided with the document, not with the loudest model:

### 1) Polling-Accuracy Leaderboard Score Computation
- Exact Shrinkage Formula: score = ( n · adjusted MAE + k · field mean ) / ( n + k )  (Page 3)
- Value of k: 4  (Page 3)

### 2) Market Links Worked Example
- Grade: A+ composite  (Page 4)
- Raw MAE: 1.67  (Page 4)
- Shrunk MAE: 0.97  (Page 4)
- 4% call rate: 90%  (Page 4)

### 3) Composite Risk Index
The Composite Risk Index produces one election-level 0–100 headline score that
summarizes process-integrity signals (Page 2, Page 8). It is built in 2 separated tracks:
- Integrity track: 5 vote-weighted signals averaged into the headline (Page 8).
- Context track: 5 signals shown alongside, never folded into the score (Page 8).

Every figure here matches the source, and every figure carries a page you can open and check.


Performance

Metric Gemini 3.1 Pro Kimi K2.6 Qwen3.5 Plus
Cost $0.026 $0.013 $0.016
Time 14.7s 70.8s 46.6s
Response length ~2,700 chars ~1,360 ~4,140
Eval (overall) 2.8 9.4 8.8

Similarity matrix

Pair Similarity
Kimi ↔ Qwen 0.91
Gemini ↔ Kimi 0.84
Gemini ↔ Qwen 0.83

The two correct answers were the most similar to each other (0.91); Gemini's outlier sat lower against both. Aggregate agreement landed at 86%.


The Verdict

🏆 Kimi K2.6 takes it — slowest, shortest, and exactly right, with clean page citations. Qwen3.5 Plus was a close second and the most rigorous about quoting the source.

The real lesson is about the system, not the winner. Retrieval-augmented PDFs make a long document queryable and cheap to reason over — but they do not make any single model infallible. Gemini 3.1 Pro had the same indexed text and still invented numbers. What saved the answer was the ensemble (two models retrieved correctly), the judges (they scored the fabrication 2/10), and the page citations (you can verify every figure in seconds).

Strategic takeaway: for documents you actually rely on, do not trust a lone model's confident summary. Run a panel, demand page citations, and let the judges flag the outlier. The cost of that safety net here was about eleven cents.


Try it yourself

  1. Open the dashboard and pick two or three models.
  2. Attach a text-based PDF larger than 1.5 MB.
  3. Choose the Expert Panel strategy.
  4. Paste the prompt below and run.
Using only this document, answer these and cite the page number for every
figure: [your questions]. Cite the page number for every figure.

Explore the run: Read the full panel, the judges' scores, and the raw model outputs in the Shared Chat Session.

Further Reading