Benchmarks
The tests behind the rankings. Each card explains what the benchmark measures and how scores are oriented.
By industry
Who leads each professional domain, scored across its benchmarks. Mappings are proposals (a benchmark can inform several industries) — open one for the full breakdown.
Software Engineering
2 benchmarksShipping working code against real repositories: bug fixes, feature patches, and competitive programming under tests.
Investment Banking
3 benchmarksFinancial analysis over filings and credit agreements — valuation math, document QA, and the quantitative reasoning behind deals.
Corporate Law
3 benchmarksLegal reasoning — issue spotting, rule application, and contract analysis — plus the broad knowledge a generalist counsel needs.
Medicine
3 benchmarksClinical knowledge and diagnostic reasoning, including the medical coding accuracy and science depth that real practice demands.
Scientific Research
3 benchmarksFrontier problem solving — graduate-level science, the hardest multi-domain exams, and broad expert knowledge.
Management Consulting
3 benchmarksBroad analytical reasoning across business domains — structured problem solving over wide-ranging knowledge.
Accounting & Audit
3 benchmarksNumerically exact work over tax and financial documents — reconciliation, controls, and the arithmetic discipline audits demand.
Language Models
GPQA Diamond
Graduate-level, Google-proof science questions (physics, chemistry, biology) written by domain experts to resist web lookup.
Humanity's Last Exam
A broad, extremely difficult exam across math, humanities, and science designed to remain unsaturated by frontier models. Reported here without external tools.
SWE-bench Verified
A human-validated subset of real GitHub issues; the model must produce a patch that passes the repository's tests. Figures are vendor-reported unless noted.
LiveCodeBench
Contamination-resistant competitive-programming problems collected over time to avoid training-set overlap.
MMLU-Pro
A harder, cleaned-up successor to MMLU spanning 57+ subjects with 10-way multiple choice and reasoning-heavy items.
LMArena Elo
Crowd-sourced pairwise preference rating from blind head-to-head chats (LMArena, formerly LMSYS). Unbounded; ~1000 is the historical anchor.
AA Intelligence Index
Artificial Analysis Intelligence Index (v4.0) — an independent composite across ~10 evaluations (incl. GPQA Diamond, HLE, Terminal-Bench, SciCode, GDPval, τ²-Bench). The de-facto cross-model standard; higher is better. Shown for reference and normalized relative to this set.
Corporate Finance
Vals AI CorpFin v2 — expert-built questions over long-context corporate credit agreements; an independent, in-house-run finance benchmark.
LegalBench
Legal-reasoning task suite (originated by Stanford CodeX), run independently by Vals AI and reported as overall accuracy across tasks.
TaxEval
Vals AI TaxEval v2 — 1,500+ expert-written tax questions, scored on overall accuracy. Independent, in-house-run.
Medical Coding
Vals AI MedCode — accuracy of ICD-10-CM diagnosis coding for the medical billing process. Independent, expert-built dataset.
World Models
emergingPhysion++
Predict whether physical scenarios resolve as expected (will objects collide, fall, or stay stable?). Probes intuitive physics.
EK-100 Anticipation
EPIC-Kitchens-100 long-term action anticipation — forecast the next actions in egocentric video. Reported as mean top-5 recall.
Video Prediction (FVD)
Fréchet Video Distance between predicted and ground-truth future frames. Measures rollout realism — lower is better.
Embodied Planning
Success rate of model-based planning / imagined rollouts on embodied control and navigation tasks.
World Consistency
Geometric and temporal consistency of generated/imagined worlds (object permanence, 3D coherence under camera motion).
