Opus 4 launched May 22 at 72.5% SWE-bench, immediately consensus #1. R.E.X weights Real-world SWE and Multi-language most heavily — picks Opus 4.
Coding
Dimensions weighted (high → low): Real-world SWE · Multi-language · Basic Code Gen · Algorithmic Reasoning · Speed / Context. Exact weights are proprietary.
Community split: Gemini WebDev #1, Opus 4 SWE-bench #1. R.E.X leans hardest on real-world SWE → picks Opus 4. Consensus favors Gemini for WebDev-heavy community.
Grok 4 community-lauded at launch (72-75% SWE community estimate, #1 LiveCodeBench). R.E.X requires verified SWE-bench submission → sticks with Opus 4 (verified 72.5%).
GPT-5 Aug 7: 74.9% SWE-bench + 88% Aider Polyglot + #1 LMArena. R.E.X picks GPT-5 — highest weighted composite that month.
Sonnet 4.5 launched Sep 29 at 77.2% SWE-bench — highest public. R.E.X formula picks Sonnet 4.5.
Sonnet 4.5 SWE-bench 77.2% #1. Aider Polyglot has GPT-5 #1 at 88%, but 4.x-series not yet on Aider board. R.E.X picks Sonnet 4.5.
Opus 4.5 launched Nov 24 at 80.9% SWE-bench — first model above 80%. R.E.X picks Opus 4.5.
Opus 4.5 holds 80.9% SWE-bench through December. Gemini 3 Flash launched Dec 17 at 78%. R.E.X picks Opus 4.5.
No new frontier launches in January. Opus 4.5 retains #1 at 80.9%. R.E.X picks Opus 4.5.
Opus 4.6 launched Feb 5 at 80.84% SWE-bench. Three-way cluster with Opus 4.5 (80.9%) and Gemini 3.1 Pro (80.6%). Opus 4.6 edges on newer model + higher Terminal-Bench. R.E.X picks Opus 4.6.
GPT-5.4 launched Mar 5 with strong Aider scores but Opus 4.6 retains SWE-bench Verified #1 through March. R.E.X picks Opus 4.6.
Opus 4.7 launched Apr 16 at 87.6% SWE-bench Verified + 64.3% SWE-bench Pro — top on both. LMArena Code #1, BenchLM #2, AA #1, Vellum #1. R.E.X picks Opus 4.7.
Writing
Dimensions weighted (high → low): Creative Prose · Instruction Following · Tone / Style Control · Long-form Coherence. Exact weights are proprietary.
Opus 4 launched May 22 — immediately #2 EQ-Bench CW v3 and community writing favorite. R.E.X weights Creative Prose and Instruction Following most heavily — picks Opus 4.
Opus 4 community favorite. EQ-Bench CW v3 re-scoring to Sonnet 4 judge. R.E.X picks Opus 4.
Horizon-alpha took EQ-Bench CW v3 #1 in late July but was unreleased/unidentified publicly. R.E.X can't pick an unknown checkpoint — picks Opus 4 (verified #2 and community #1 for identified models).
GPT-5 revealed as Horizon-alpha lineage on Aug 7. Takes EQ-Bench CW v3 #1. R.E.X picks GPT-5.
Sonnet 4.5 launched Sep 29, immediately tops BOTH EQ-Bench evals. Community consensus shifts. R.E.X picks Sonnet 4.5.
Sonnet 4.5 holds EQ-Bench CW v3 and Longform top. Polaris-alpha is unidentified checkpoint. R.E.X picks Sonnet 4.5.
Sonnet 4.5 still Longform #1 and community consensus through most of November. Opus 4.5 launched Nov 24 but too late to dominate month. R.E.X picks Sonnet 4.5.
Opus 4.5 at 1737.2 Elo on live EQ-Bench CW v3. R.E.X picks Opus 4.5.
No new writing leader in January. Opus 4.5 retains. R.E.X picks Opus 4.5.
Opus 4.6 launched Feb 5; Sonnet 4.6 Feb 17. Both top-tier on EQ-Bench. Opus 4.6 edges Sonnet by ~25 Elo. R.E.X picks Opus 4.6.
GPT-5.4 launched Mar 5, immediately tops EQ-Bench CW v3 at 1991.7. R.E.X weights Instruction Following heavily where GPT-5.4 leads (96% vs Sonnet 95%). But Sonnet 4.6 holds Arena CW Elo #1. Community split. R.E.X picks Sonnet 4.6 (narrowly).
GPT-5.4 holds EQ-Bench CW v3 #1 at 1991.7. Opus 4.7 launched Apr 16, takes LMArena CW #1 at 1499 Elo. R.E.X weights EQ-Bench heavily in Creative Prose but Sonnet 4.6 has highest Sonnet-family scores. R.E.X picks Sonnet 4.6.
RAG
Dimensions weighted (high → low): Deep-context Retrieval · Faithfulness · Standards / Comprehension · Mid-context Retrieval · Speed / Cost. Exact weights are proprietary.
Gemini 2.5 Pro dominated with 1M context, top NIAH, top Vectara. R.E.X picks Gemini 2.5 Pro.
Gemini 2.5 Pro GA on June 17 with stable release. Retains long-context dominance. R.E.X picks Gemini 2.5 Pro.
Grok 4 launched Jul 9 with 256K context claim but vision/reasoning focus. Gemini 2.5 Pro still dominant for long-context RAG. R.E.X picks Gemini 2.5 Pro.
GPT-5 Aug 7: 45% fewer hallucinations than GPT-4o, #1 LMArena. For general RAG tasks under 200K tokens, GPT-5 #1. For 200K+ tasks Gemini 2.5 Pro. R.E.X weights general RAG higher → picks GPT-5.
Gemini 2.5 Pro retains long-context top. Sonnet 4.5 launched Sep 29 with improved RAG but too late. R.E.X picks Gemini 2.5 Pro.
Gemini 2.5 Pro still #1 on MRCR@1M and Vectara FaithJudge. R.E.X picks Gemini 2.5 Pro.
Gemini 3 Pro launched Nov but MRCR@1M collapsed to 26% vs Opus 4.5 at 76%. Advertised 1M context did not hold. R.E.X picks Gemini 2.5 Pro (still measured best); consensus picks newer Gemini 3 Pro on marketing.
Community catches up to measurement: Opus 4.5 now recognized as RAG #1 after MRCR evidence. R.E.X picks Opus 4.5.
Opus 4.5 continues leading deep-context. Gemini 3.1 Pro not yet released. R.E.X picks Opus 4.5.
Opus 4.6 launched Feb 5 with 1M context GA. MRCR@1M 78.3% vs Opus 4.5 76%. R.E.X picks Opus 4.6.
Opus 4.6 retains MRCR@1M top. LMArena Document Arena launches mid-March with Opus 4.6 #1. R.E.X picks Opus 4.6.
Opus 4.7 launched Apr 16 — LMArena Document Arena #1 at 1521 Elo, MMLU-Pro #1. MRCR@1M regressed to 32.2% but overall RAG composite strong. R.E.X picks Opus 4.7.
Vision
Dimensions weighted (high → low): Core Vision / OCR · Diagram Reasoning · Multi-modal RAG. Exact weights are proprietary.
Gemini 2.5 Pro 0506 swept LMArena text+vision+WebDev by ~70 Elo. MMMU 84%. R.E.X picks Gemini 2.5 Pro.
Gemini 2.5 Pro 0605 update maintains Vision Arena #1. R.E.X picks Gemini 2.5 Pro.
Grok 4 launched Jul 9 but not vision-focused. Gemini 2.5 Pro retains. R.E.X picks Gemini 2.5 Pro.
Statistical tie: GPT-5 MMMU 84.2% vs Gemini 84.0%. Roboflow: Gemini far ahead on object detection (13.3 vs 1.5 mAP). R.E.X picks Gemini 2.5 Pro.
Polymarket resolved Google for Sep 2025 ($2.9M market). R.E.X picks Gemini 2.5 Pro.
Sonnet 4.5 extends vision lead post-launch. DocVQA 96.1%. R.E.X picks Sonnet 4.5.
Gemini 3 Pro launched Nov with MMMU-Pro improvements. R.E.X picks Gemini 3 Pro.
Gemini 3 Flash launched Dec 17 with strong vision/cost. Gemini 3 Pro retains top absolute performance. R.E.X picks Gemini 3 Pro.
No new frontier vision launches. Gemini 3 Pro retains. R.E.X picks Gemini 3 Pro.
Opus 4.6 Feb 5: DocVQA 96.1%, ChartQA 93.4%. Gemini 3.1 Pro Feb 19 strong on MMMU-Pro. R.E.X picks Opus 4.6 (higher R.E.X Vision formula weighting on DocVQA).
GPT-5.4 Mar 5 strong on MMMU-Pro (81.2%) but Opus 4.6 leads DocVQA/ChartQA. R.E.X picks Opus 4.6.
Opus 4.7 Apr 16 takes LMArena Vision Arena #1 at 1307 Elo. But Opus 4.6 retains DocVQA/ChartQA lead — key R.E.X Vision weights. R.E.X picks Opus 4.6. Opus 4.7 is #2 on R.E.X.