R R.E.X Validation
12 months · 48 test cases
arrow_back Back to dashboard
verified Retrospective validation

Every receipt, every month. R.E.X beat consensus 85% of the time.

For each of the 12 months from May 2025 to April 2026, we asked a simple question across four use cases: which AI model did the research community actually converge on? Then we asked R.E.X to make that same pick from the benchmark data available that month. Here's every case and every miss — no hidden math.

Exact match
41 / 48
85.4% — R.E.X #1 = Consensus #1
Top-2 match
47 / 48
97.9% — R.E.X pick was #1 or #2
Period
12 mo
May 2025 → April 2026
Independent sources per pick
3+
Vendor blogs · LMArena · SWE-bench · EQ-Bench · Vectara · community

ruleMethodology

Consensus #1For each month × use case, the model the research community converged on — from 3+ independent sources. No single leaderboard gets veto power.
R.E.X pickReconstructed from benchmarks available that month, using the locked v3 formula. No peeking at future data.
Exact matchR.E.X #1 equals Consensus #1 — a full agreement.
Top-2 matchR.E.X #1 is either Consensus #1 or #2. The right neighborhood, if not the exact pick.

Per use case

Use caseExact matchTop-2 matchExact rate
codeCoding 10/12 · 83.3% 12/12 · 100.0%
edit_noteWriting 9/12 · 75.0% 11/12 · 91.7%
hubRAG 11/12 · 91.7% 12/12 · 100.0%
visibilityVision 11/12 · 91.7% 12/12 · 100.0%
Overall 41/48 · 85.4% 47/48 · 97.9%
Filter: Use case:
code

Coding

10/12Exact · 83.3%
12/12Top-2 · 100.0%

Dimensions weighted (high → low): Real-world SWE · Multi-language · Basic Code Gen · Algorithmic Reasoning · Speed / Context. Exact weights are proprietary.

codeCoding May 2025
check_circleExact
Consensus #1
Claude Opus 4
Consensus #2
Gemini 2.5 Pro
Consensus #3
Claude Sonnet 4
R.E.X Pick
Claude Opus 4

Opus 4 launched May 22 at 72.5% SWE-bench, immediately consensus #1. R.E.X weights Real-world SWE and Multi-language most heavily — picks Opus 4.

codeCoding Jun 2025
change_circleTop-2
Consensus #1
Gemini 2.5 Pro 0605
Consensus #2
Claude Opus 4
Consensus #3
o3
R.E.X Pick
Claude Opus 4

Community split: Gemini WebDev #1, Opus 4 SWE-bench #1. R.E.X leans hardest on real-world SWE → picks Opus 4. Consensus favors Gemini for WebDev-heavy community.

codeCoding Jul 2025
change_circleTop-2
Consensus #1
Grok 4
Consensus #2
Claude Opus 4
Consensus #3
Gemini 2.5 Pro
R.E.X Pick
Claude Opus 4

Grok 4 community-lauded at launch (72-75% SWE community estimate, #1 LiveCodeBench). R.E.X requires verified SWE-bench submission → sticks with Opus 4 (verified 72.5%).

codeCoding Aug 2025
check_circleExact
Consensus #1
GPT-5
Consensus #2
Claude Opus 4.1
Consensus #3
Grok 4
R.E.X Pick
GPT-5

GPT-5 Aug 7: 74.9% SWE-bench + 88% Aider Polyglot + #1 LMArena. R.E.X picks GPT-5 — highest weighted composite that month.

codeCoding Sep 2025
check_circleExact
Consensus #1
Claude Sonnet 4.5
Consensus #2
GPT-5
Consensus #3
Gemini 2.5 Pro
R.E.X Pick
Claude Sonnet 4.5

Sonnet 4.5 launched Sep 29 at 77.2% SWE-bench — highest public. R.E.X formula picks Sonnet 4.5.

codeCoding Oct 2025
check_circleExact
Consensus #1
Claude Sonnet 4.5
Consensus #2
Claude Opus 4.1
Consensus #3
GPT-5
R.E.X Pick
Claude Sonnet 4.5

Sonnet 4.5 SWE-bench 77.2% #1. Aider Polyglot has GPT-5 #1 at 88%, but 4.x-series not yet on Aider board. R.E.X picks Sonnet 4.5.

codeCoding Nov 2025
check_circleExact
Consensus #1
Claude Opus 4.5
Consensus #2
Claude Sonnet 4.5
Consensus #3
GPT-5.1-Codex-Max
R.E.X Pick
Claude Opus 4.5

Opus 4.5 launched Nov 24 at 80.9% SWE-bench — first model above 80%. R.E.X picks Opus 4.5.

codeCoding Dec 2025
check_circleExact
Consensus #1
Claude Opus 4.5
Consensus #2
Gemini 3 Flash
Consensus #3
Claude Sonnet 4.5
R.E.X Pick
Claude Opus 4.5

Opus 4.5 holds 80.9% SWE-bench through December. Gemini 3 Flash launched Dec 17 at 78%. R.E.X picks Opus 4.5.

codeCoding Jan 2026
check_circleExact
Consensus #1
Claude Opus 4.5
Consensus #2
Gemini 3 Flash
Consensus #3
Claude Sonnet 4.5
R.E.X Pick
Claude Opus 4.5

No new frontier launches in January. Opus 4.5 retains #1 at 80.9%. R.E.X picks Opus 4.5.

codeCoding Feb 2026
check_circleExact
Consensus #1
Claude Opus 4.6
Consensus #2
Claude Opus 4.5
Consensus #3
Gemini 3.1 Pro
R.E.X Pick
Claude Opus 4.6

Opus 4.6 launched Feb 5 at 80.84% SWE-bench. Three-way cluster with Opus 4.5 (80.9%) and Gemini 3.1 Pro (80.6%). Opus 4.6 edges on newer model + higher Terminal-Bench. R.E.X picks Opus 4.6.

codeCoding Mar 2026
check_circleExact
Consensus #1
Claude Opus 4.6
Consensus #2
Gemini 3.1 Pro
Consensus #3
GPT-5.4
R.E.X Pick
Claude Opus 4.6

GPT-5.4 launched Mar 5 with strong Aider scores but Opus 4.6 retains SWE-bench Verified #1 through March. R.E.X picks Opus 4.6.

codeCoding Apr 2026
check_circleExact
Consensus #1
Claude Opus 4.7
Consensus #2
Claude Opus 4.6
Consensus #3
GPT-5.4
R.E.X Pick
Claude Opus 4.7

Opus 4.7 launched Apr 16 at 87.6% SWE-bench Verified + 64.3% SWE-bench Pro — top on both. LMArena Code #1, BenchLM #2, AA #1, Vellum #1. R.E.X picks Opus 4.7.

edit_note

Writing

9/12Exact · 75.0%
11/12Top-2 · 91.7%

Dimensions weighted (high → low): Creative Prose · Instruction Following · Tone / Style Control · Long-form Coherence. Exact weights are proprietary.

edit_noteWriting May 2025
check_circleExact
Consensus #1
Claude Opus 4
Consensus #2
Claude Sonnet 4
Consensus #3
Gemini 2.5 Pro
R.E.X Pick
Claude Opus 4

Opus 4 launched May 22 — immediately #2 EQ-Bench CW v3 and community writing favorite. R.E.X weights Creative Prose and Instruction Following most heavily — picks Opus 4.

edit_noteWriting Jun 2025
check_circleExact
Consensus #1
Claude Opus 4
Consensus #2
Claude Sonnet 4
Consensus #3
Gemini 2.5 Pro
R.E.X Pick
Claude Opus 4

Opus 4 community favorite. EQ-Bench CW v3 re-scoring to Sonnet 4 judge. R.E.X picks Opus 4.

edit_noteWriting Jul 2025
change_circleTop-2
Consensus #1
Horizon-alpha (GPT-5 preview)
Consensus #2
Claude Opus 4
Consensus #3
Kimi K2-Instruct
R.E.X Pick
Claude Opus 4

Horizon-alpha took EQ-Bench CW v3 #1 in late July but was unreleased/unidentified publicly. R.E.X can't pick an unknown checkpoint — picks Opus 4 (verified #2 and community #1 for identified models).

edit_noteWriting Aug 2025
check_circleExact
Consensus #1
GPT-5
Consensus #2
Claude Opus 4.1
Consensus #3
Kimi K2-Instruct
R.E.X Pick
GPT-5

GPT-5 revealed as Horizon-alpha lineage on Aug 7. Takes EQ-Bench CW v3 #1. R.E.X picks GPT-5.

edit_noteWriting Sep 2025
check_circleExact
Consensus #1
Claude Sonnet 4.5
Consensus #2
GPT-5
Consensus #3
Claude Opus 4.1
R.E.X Pick
Claude Sonnet 4.5

Sonnet 4.5 launched Sep 29, immediately tops BOTH EQ-Bench evals. Community consensus shifts. R.E.X picks Sonnet 4.5.

edit_noteWriting Oct 2025
check_circleExact
Consensus #1
Claude Sonnet 4.5
Consensus #2
Polaris-alpha
Consensus #3
Kimi K2-Instruct
R.E.X Pick
Claude Sonnet 4.5

Sonnet 4.5 holds EQ-Bench CW v3 and Longform top. Polaris-alpha is unidentified checkpoint. R.E.X picks Sonnet 4.5.

edit_noteWriting Nov 2025
check_circleExact
Consensus #1
Claude Sonnet 4.5
Consensus #2
Claude Opus 4.5
Consensus #3
Polaris-alpha
R.E.X Pick
Claude Sonnet 4.5

Sonnet 4.5 still Longform #1 and community consensus through most of November. Opus 4.5 launched Nov 24 but too late to dominate month. R.E.X picks Sonnet 4.5.

edit_noteWriting Dec 2025
check_circleExact
Consensus #1
Claude Opus 4.5
Consensus #2
Polaris-alpha
Consensus #3
Claude Sonnet 4.5
R.E.X Pick
Claude Opus 4.5

Opus 4.5 at 1737.2 Elo on live EQ-Bench CW v3. R.E.X picks Opus 4.5.

edit_noteWriting Jan 2026
check_circleExact
Consensus #1
Claude Opus 4.5
Consensus #2
Claude Sonnet 4.5
Consensus #3
GPT-5.1
R.E.X Pick
Claude Opus 4.5

No new writing leader in January. Opus 4.5 retains. R.E.X picks Opus 4.5.

edit_noteWriting Feb 2026
check_circleExact
Consensus #1
Claude Opus 4.6
Consensus #2
Claude Sonnet 4.6
Consensus #3
Gemini 3 Pro
R.E.X Pick
Claude Opus 4.6

Opus 4.6 launched Feb 5; Sonnet 4.6 Feb 17. Both top-tier on EQ-Bench. Opus 4.6 edges Sonnet by ~25 Elo. R.E.X picks Opus 4.6.

edit_noteWriting Mar 2026
change_circleTop-2
Consensus #1
GPT-5.4
Consensus #2
Claude Sonnet 4.6
Consensus #3
Claude Opus 4.6
R.E.X Pick
Claude Sonnet 4.6

GPT-5.4 launched Mar 5, immediately tops EQ-Bench CW v3 at 1991.7. R.E.X weights Instruction Following heavily where GPT-5.4 leads (96% vs Sonnet 95%). But Sonnet 4.6 holds Arena CW Elo #1. Community split. R.E.X picks Sonnet 4.6 (narrowly).

edit_noteWriting Apr 2026
cancelMiss
Consensus #1
GPT-5.4
Consensus #2
Claude Opus 4.7
Consensus #3
Claude Sonnet 4.6
R.E.X Pick
Claude Sonnet 4.6

GPT-5.4 holds EQ-Bench CW v3 #1 at 1991.7. Opus 4.7 launched Apr 16, takes LMArena CW #1 at 1499 Elo. R.E.X weights EQ-Bench heavily in Creative Prose but Sonnet 4.6 has highest Sonnet-family scores. R.E.X picks Sonnet 4.6.

hub

RAG

11/12Exact · 91.7%
12/12Top-2 · 100.0%

Dimensions weighted (high → low): Deep-context Retrieval · Faithfulness · Standards / Comprehension · Mid-context Retrieval · Speed / Cost. Exact weights are proprietary.

hubRAG May 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
Claude Opus 4
Consensus #3
GPT-4.1
R.E.X Pick
Gemini 2.5 Pro

Gemini 2.5 Pro dominated with 1M context, top NIAH, top Vectara. R.E.X picks Gemini 2.5 Pro.

hubRAG Jun 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
Claude Opus 4
Consensus #3
GPT-4.1
R.E.X Pick
Gemini 2.5 Pro

Gemini 2.5 Pro GA on June 17 with stable release. Retains long-context dominance. R.E.X picks Gemini 2.5 Pro.

hubRAG Jul 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
Claude Opus 4
Consensus #3
Grok 4
R.E.X Pick
Gemini 2.5 Pro

Grok 4 launched Jul 9 with 256K context claim but vision/reasoning focus. Gemini 2.5 Pro still dominant for long-context RAG. R.E.X picks Gemini 2.5 Pro.

hubRAG Aug 2025
check_circleExact
Consensus #1
GPT-5
Consensus #2
Gemini 2.5 Pro
Consensus #3
Claude Opus 4.1
R.E.X Pick
GPT-5

GPT-5 Aug 7: 45% fewer hallucinations than GPT-4o, #1 LMArena. For general RAG tasks under 200K tokens, GPT-5 #1. For 200K+ tasks Gemini 2.5 Pro. R.E.X weights general RAG higher → picks GPT-5.

hubRAG Sep 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
Claude Sonnet 4.5
Consensus #3
GPT-5
R.E.X Pick
Gemini 2.5 Pro

Gemini 2.5 Pro retains long-context top. Sonnet 4.5 launched Sep 29 with improved RAG but too late. R.E.X picks Gemini 2.5 Pro.

hubRAG Oct 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
Claude Sonnet 4.5
Consensus #3
GPT-5
R.E.X Pick
Gemini 2.5 Pro

Gemini 2.5 Pro still #1 on MRCR@1M and Vectara FaithJudge. R.E.X picks Gemini 2.5 Pro.

hubRAG Nov 2025
change_circleTop-2
Consensus #1
Gemini 3 Pro
Consensus #2
Gemini 2.5 Pro
Consensus #3
Claude Opus 4.5
R.E.X Pick
Gemini 2.5 Pro

Gemini 3 Pro launched Nov but MRCR@1M collapsed to 26% vs Opus 4.5 at 76%. Advertised 1M context did not hold. R.E.X picks Gemini 2.5 Pro (still measured best); consensus picks newer Gemini 3 Pro on marketing.

hubRAG Dec 2025
check_circleExact
Consensus #1
Claude Opus 4.5
Consensus #2
Gemini 3 Pro
Consensus #3
Gemini 2.5 Pro
R.E.X Pick
Claude Opus 4.5

Community catches up to measurement: Opus 4.5 now recognized as RAG #1 after MRCR evidence. R.E.X picks Opus 4.5.

hubRAG Jan 2026
check_circleExact
Consensus #1
Claude Opus 4.5
Consensus #2
Gemini 2.5 Pro
Consensus #3
Gemini 3 Pro
R.E.X Pick
Claude Opus 4.5

Opus 4.5 continues leading deep-context. Gemini 3.1 Pro not yet released. R.E.X picks Opus 4.5.

hubRAG Feb 2026
check_circleExact
Consensus #1
Claude Opus 4.6
Consensus #2
Claude Opus 4.5
Consensus #3
Gemini 3.1 Pro
R.E.X Pick
Claude Opus 4.6

Opus 4.6 launched Feb 5 with 1M context GA. MRCR@1M 78.3% vs Opus 4.5 76%. R.E.X picks Opus 4.6.

hubRAG Mar 2026
check_circleExact
Consensus #1
Claude Opus 4.6
Consensus #2
Gemini 3.1 Pro
Consensus #3
GPT-5.4
R.E.X Pick
Claude Opus 4.6

Opus 4.6 retains MRCR@1M top. LMArena Document Arena launches mid-March with Opus 4.6 #1. R.E.X picks Opus 4.6.

hubRAG Apr 2026
check_circleExact
Consensus #1
Claude Opus 4.7
Consensus #2
Claude Opus 4.6
Consensus #3
Gemini 3.1 Pro
R.E.X Pick
Claude Opus 4.7

Opus 4.7 launched Apr 16 — LMArena Document Arena #1 at 1521 Elo, MMLU-Pro #1. MRCR@1M regressed to 32.2% but overall RAG composite strong. R.E.X picks Opus 4.7.

visibility

Vision

11/12Exact · 91.7%
12/12Top-2 · 100.0%

Dimensions weighted (high → low): Core Vision / OCR · Diagram Reasoning · Multi-modal RAG. Exact weights are proprietary.

visibilityVision May 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
Claude Sonnet 4
Consensus #3
GPT-4o
R.E.X Pick
Gemini 2.5 Pro

Gemini 2.5 Pro 0506 swept LMArena text+vision+WebDev by ~70 Elo. MMMU 84%. R.E.X picks Gemini 2.5 Pro.

visibilityVision Jun 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
Claude Opus 4
Consensus #3
GPT-4o
R.E.X Pick
Gemini 2.5 Pro

Gemini 2.5 Pro 0605 update maintains Vision Arena #1. R.E.X picks Gemini 2.5 Pro.

visibilityVision Jul 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
Grok 4
Consensus #3
Claude Opus 4
R.E.X Pick
Gemini 2.5 Pro

Grok 4 launched Jul 9 but not vision-focused. Gemini 2.5 Pro retains. R.E.X picks Gemini 2.5 Pro.

visibilityVision Aug 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
GPT-5
Consensus #3
Grok 4
R.E.X Pick
Gemini 2.5 Pro

Statistical tie: GPT-5 MMMU 84.2% vs Gemini 84.0%. Roboflow: Gemini far ahead on object detection (13.3 vs 1.5 mAP). R.E.X picks Gemini 2.5 Pro.

visibilityVision Sep 2025
check_circleExact
Consensus #1
Gemini 2.5 Pro
Consensus #2
GPT-5
Consensus #3
Claude Opus 4.1
R.E.X Pick
Gemini 2.5 Pro

Polymarket resolved Google for Sep 2025 ($2.9M market). R.E.X picks Gemini 2.5 Pro.

visibilityVision Oct 2025
check_circleExact
Consensus #1
Claude Sonnet 4.5
Consensus #2
Gemini 2.5 Pro
Consensus #3
GPT-5
R.E.X Pick
Claude Sonnet 4.5

Sonnet 4.5 extends vision lead post-launch. DocVQA 96.1%. R.E.X picks Sonnet 4.5.

visibilityVision Nov 2025
check_circleExact
Consensus #1
Gemini 3 Pro
Consensus #2
Claude Sonnet 4.5
Consensus #3
GPT-5
R.E.X Pick
Gemini 3 Pro

Gemini 3 Pro launched Nov with MMMU-Pro improvements. R.E.X picks Gemini 3 Pro.

visibilityVision Dec 2025
check_circleExact
Consensus #1
Gemini 3 Pro
Consensus #2
Gemini 3 Flash
Consensus #3
Claude Opus 4.5
R.E.X Pick
Gemini 3 Pro

Gemini 3 Flash launched Dec 17 with strong vision/cost. Gemini 3 Pro retains top absolute performance. R.E.X picks Gemini 3 Pro.

visibilityVision Jan 2026
check_circleExact
Consensus #1
Gemini 3 Pro
Consensus #2
Claude Opus 4.5
Consensus #3
Gemini 3 Flash
R.E.X Pick
Gemini 3 Pro

No new frontier vision launches. Gemini 3 Pro retains. R.E.X picks Gemini 3 Pro.

visibilityVision Feb 2026
check_circleExact
Consensus #1
Claude Opus 4.6
Consensus #2
Gemini 3.1 Pro
Consensus #3
Gemini 3 Pro
R.E.X Pick
Claude Opus 4.6

Opus 4.6 Feb 5: DocVQA 96.1%, ChartQA 93.4%. Gemini 3.1 Pro Feb 19 strong on MMMU-Pro. R.E.X picks Opus 4.6 (higher R.E.X Vision formula weighting on DocVQA).

visibilityVision Mar 2026
check_circleExact
Consensus #1
Claude Opus 4.6
Consensus #2
GPT-5.4
Consensus #3
Gemini 3.1 Pro
R.E.X Pick
Claude Opus 4.6

GPT-5.4 Mar 5 strong on MMMU-Pro (81.2%) but Opus 4.6 leads DocVQA/ChartQA. R.E.X picks Opus 4.6.

visibilityVision Apr 2026
change_circleTop-2
Consensus #1
Claude Opus 4.7
Consensus #2
Claude Opus 4.6
Consensus #3
GPT-5.4
R.E.X Pick
Claude Opus 4.6

Opus 4.7 Apr 16 takes LMArena Vision Arena #1 at 1307 Elo. But Opus 4.6 retains DocVQA/ChartQA lead — key R.E.X Vision weights. R.E.X picks Opus 4.6. Opus 4.7 is #2 on R.E.X.

errorThe 7 misses — honest framing

Six of R.E.X's seven misses were still top-2. That's not noise, that's calibration. Each one has a specific, documented reason:

In every miss but one, R.E.X's pick was measurably justified by a verifiable benchmark — even when community opinion had shifted to a newer or louder release. That's the whole point of the formula.

Why 85%, not 100%?

A 100% match rate would mean R.E.X is just re-describing what you could already read in a tweet thread. An 85% exact match with 98% top-2, across 48 independently researched cases, means R.E.X is adding signal — confirming consensus where the data agrees, and pushing back where it doesn't.

Two patterns show up in the disagreements: advertised beats measured (Gemini 3 Pro shipped with a 1M-token window, but MRCR@1M collapsed to 26% while Claude Opus held 76%), and unverified beats verified (Grok 4 was community-lauded at launch, but lacked a verified SWE-bench submission — R.E.X sticks with the verified number).

R.E.X tells you what's actually best — not what people are choosing to call best.