R.E.X Dashboard
Live AI model rankings
trending_up Top 5 · R.E.X Rankings

Top 5 AI Models for Coding

The absolute best performing models right now — scored by our proprietary R.E.X system across real-world benchmarks, hallucination rate, latency and cost.

code
What “Coding” covers Writing, editing, refactoring and debugging code

Real engineering work — solving GitHub issues, editing large files, working across multiple languages, and following through in a terminal. Pick this if you want a coding copilot, an autonomous agent, or a model to embed in a dev tool.

emoji_events Best R.E.X Score
pts
savings Best Value (R.E.X / $)
/$
bolt Fastest
tok/s
Leaderboard Ad · 728 × 90

Full Top 5 Leaderboard

Sorted by R.E.X score · desc
# Model R.E.X Price (I/O / 1M) Speed Context Hallucination Vendor
Unlock the full Top 20 leaderboard with R.E.X Pro
Full rankings, historical score trends, CSV exports, and price-drop alerts — $9/mo.

Cost vs R.E.X

Lower-left = budget · Upper-right = premium

Score Components

Stacked by category weight

Speed vs R.E.X

Throughput vs performance trade-off

Release Timeline

When each model launched · current R.E.X score

Vendor R.E.X Trajectory · Last 6 months

Aggregated vendor score over time — models come and go, vendors endure
lock
Vendor Trend Analysis
Track how each lab's best R.E.X score evolves month-over-month — Anthropic vs OpenAI vs Google vs xAI vs DeepSeek and more. Spot the vendors gaining momentum.

Price-drop & new-model alerts

Email you when any Top 5 model changes
notifications_active
Real-time Model Alerts
Get emailed the minute a new frontier model lands or a top-5 model drops its price. Never miss a shift.

R.E.X Frontier Index · The NASDAQ of AI capability

One unbounded number tracking the frontier of AI capability over time — Oct 20, 2025 = 100
trending_up
Track the frontier of AI, year over year
While R.E.X Rank tells you the best model today, the Frontier Index shows how far the industry has moved since launch — composited from unbounded axes like task horizon, effective context, cost-per-task, and agentic reliability. Anchored at 100 on Oct 20, 2025, free to climb past 500, 1000, and beyond as AI keeps advancing.
schedule Coming soon to R.E.X Pro
🏆 Overall Pick

Try it arrow_forward
💰 Best Value

Try it arrow_forward
⚡ Fastest

Try it arrow_forward
Inline Ad · 728 × 90
Does your team use AI daily?
R.E.X Pro gives you full rankings, alerts, and an ad-free view for $9/mo. Cancel anytime.
auto_awesome

Why R.E.X?

There are dozens of AI model leaderboards online — LMArena, SWE-bench, EQ-Bench, Vectara, MRCR, Artificial Analysis, vendor blogs, benchmark papers, Reddit threads. No two agree. Each one captures a single slice of a model's behaviour, and picking “the best” model means cross-referencing all of them, weighting what actually matters for your use case, and ignoring the marketing noise. That's a full-time job. R.E.X does it for you.

hub
One universal score We combine 20+ independent benchmark sources into a single composite per use case — so you can compare apples to apples in one place.
tune
Weighted for real work Each use case gets its own formula that rewards production-grade benchmarks (SWE-bench Pro, MRCR@128k, EQ-Bench) over toy tests.
verified
Validated, not asserted Retrospectively tested across 48 month × use-case cases: R.E.X matched the research community's #1 pick 85.4% of the time, top-2 97.9%.
anchor
Anchored index Oct 20, 2025 = 100 launch-day reference. Scores float above or below as models advance or fall behind — no artificial 100-point cap.
stars

Key Features

anchor
Anchored index
Oct 20, 2025 = 100 baseline. Scores float above or below the launch-day frontier as models advance or fall behind.
refresh
Continuously researched
Daily automated review cross-references 20+ sources. New benchmarks, price drops and model launches roll in as they land.
verified
Validated, not asserted
85.4% exact / 97.9% top-2 across 48 retrospective month × use-case tests spanning 12 months. Numbers published in full.
forum
Challenges consensus
Built to disagree with the crowd when the data disagrees. Advertised specs lose to measured behaviour every time.
shield
Outlier-capped composite
No single saturated benchmark can dominate — each component is capped at 2.0× the Oct 2025 reference before weighting.
build
Production-focused
Prioritises benchmarks that predict real work (SWE-bench Pro, MRCR@128k, Vectara HHEM) over toy tests.
biotech

🔍 How R.E.X Works — The Science Behind the Scores

R.E.X is a proprietary composite index built to answer one question:

“Which model is actually best for this use case — not just which one people are talking about this week?”

R.E.X is designed to challenge consensus where the data disagrees. When a vendor advertises a 1M-token context window but measured long-context retrieval collapses to 26%, we trust the measurement — not the marketing. The goal isn't to echo leaderboard chatter; it's to surface what the data actually shows.

How the score is built

Every R.E.X score is a weighted composite across multiple measurement dimensions, anchored to the October 20, 2025 launch-day snapshot. A model's score climbs above 100 when it beats the launch-day frontier, and falls below when it lags. Weights are tuned per use case.

Task-grounded benchmarks
Community evaluation
Context & retrieval quality
Reasoning & reliability
Speed, cost & availability
Per-use-case weighting (proprietary)
R.E.X Score

Anchored at 100 on Oct 20, 2025. Unbounded below, capped in practice above so no single saturated benchmark can dominate.

science The formula is a living thing

We're constantly retuning R.E.X internally — the weights, the math, the benchmarks we trust, even which dimensions make the cut. As new benchmarks emerge (and old ones saturate), as new models ship, and as the state of the art moves, we rerun the retrospective validation and update the formula. What you see today is the best version today; next month it may be different. That's by design.

R.E.X tells you what's actually best — not what people are choosing to call best.

verified

📊 Validation Results — 12 Months, 48 Retrospective Tests

R.E.X isn't just theory — we stress-tested it. For every month from May 2025 through April 2026, we identified the model the research community converged on as best for each of four core use cases, then asked R.E.X to make the same pick from the data available at the time. No peeking at future results.

85.4%
Exact match with community consensus (41 / 48 cases)
97.9%
R.E.X pick was #1 or #2 in the community ranking (47 / 48)
48
Retrospective tests across 12 months × 4 use cases

Per use case: Coding 10/12 exact (100% top-2) · Writing 9/12 (91.7% top-2) · RAG 11/12 (100% top-2) · Vision 11/12 (100% top-2).

What the misses tell us

Seven cases were misses — six of those were still top-2. They aren't errors. Consensus reflects what people choose, shaped by hype cycles, marketing, and chat-UI availability. R.E.X reflects what the data actually shows. When they diverge, two patterns show up:

  • Advertised > measured. Gemini 3 Pro shipped with a 1M-token context window, but measured long-context retrieval collapsed to 26% while Claude Opus held 76%. Consensus picked the marketing headline; R.E.X picked the measurement.
  • Unverified > verified. Grok 4 was community-lauded at launch but lacked a verified benchmark submission for months. R.E.X stuck with the verified number.

An 85% match confirms the consensus where the data agrees. The 15% where R.E.X pushes back is the whole reason it exists.

Every case, broken down in full
Per-month consensus picks vs. R.E.X picks across all 48 cases.
See the full breakdown arrow_forward