Benchmarks

Investment Banking

Financial analysis over filings and credit agreements — valuation math, document QA, and the quantitative reasoning behind deals.

Corporate Law

Legal reasoning — issue spotting, rule application, and contract analysis — plus the broad knowledge a generalist counsel needs.

Medicine

Clinical knowledge and diagnostic reasoning, including the medical coding accuracy and science depth that real practice demands.

Scientific Research

Frontier problem solving — graduate-level science, the hardest multi-domain exams, and broad expert knowledge.

DeepSeek V4-Pro80.5

Claude Opus 4.875.1

DeepSeek V4-Flash72.9

Claude Opus 4.772.5

Gemini 3.1 Pro70.0

Management Consulting

Broad analytical reasoning across business domains — structured problem solving over wide-ranging knowledge.

DeepSeek V4-Pro80.5

Claude Opus 4.875.1

DeepSeek V4-Flash72.9

Claude Opus 4.772.5

Gemini 3.1 Pro70.0

Accounting & Audit

Numerically exact work over tax and financial documents — reconciliation, controls, and the arithmetic discipline audits demand.

Language Models

Reasoning23 results

GPQA Diamond

Graduate-level, Google-proof science questions (physics, chemistry, biology) written by domain experts to resist web lookup.

Frontier22 results

Humanity's Last Exam

A broad, extremely difficult exam across math, humanities, and science designed to remain unsaturated by frontier models. Reported here without external tools.

Agentic Coding20 results

SWE-bench Verified

A human-validated subset of real GitHub issues; the model must produce a patch that passes the repository's tests. Figures are vendor-reported unless noted.

Coding13 results

LiveCodeBench

Contamination-resistant competitive-programming problems collected over time to avoid training-set overlap.

Knowledge10 results

MMLU-Pro

A harder, cleaned-up successor to MMLU spanning 57+ subjects with 10-way multiple choice and reasoning-heavy items.

Human Preference25 results

LMArena Elo

Crowd-sourced pairwise preference rating from blind head-to-head chats (LMArena, formerly LMSYS). Unbounded; ~1000 is the historical anchor.

elo · higher is better

Composite31 results

AA Intelligence Index

Artificial Analysis Intelligence Index (v4.0) — an independent composite across ~10 evaluations (incl. GPQA Diamond, HLE, Terminal-Bench, SciCode, GDPval, τ²-Bench). The de-facto cross-model standard; higher is better. Shown for reference and normalized relative to this set.

pts · higher is better

Finance26 results

Corporate Finance

Vals AI CorpFin v2 — expert-built questions over long-context corporate credit agreements; an independent, in-house-run finance benchmark.

Law24 results

LegalBench

Legal-reasoning task suite (originated by Stanford CodeX), run independently by Vals AI and reported as overall accuracy across tasks.

Tax & Accounting25 results

TaxEval

Vals AI TaxEval v2 — 1,500+ expert-written tax questions, scored on overall accuracy. Independent, in-house-run.

Medicine21 results

Medical Coding

Vals AI MedCode — accuracy of ICD-10-CM diagnosis coding for the medical billing process. Independent, expert-built dataset.

Motion Understanding1 result

World Models

emerging

Something-Something v2

Action recognition over ~220k short clips of everyday object interactions; rewards genuine temporal/motion understanding over appearance. Reported as top-1 accuracy from an attentive probe on frozen features.

Action Anticipation1 result

EPIC-Kitchens-100 Anticipation

Long-term action anticipation on egocentric kitchen video — forecast the (verb, noun) action one second before it happens. Reported as mean recall@5.

recall@5 · higher is better

Video Understanding1 result

Perception Test

A diagnostic video-QA benchmark probing memory, abstraction, physics, and semantics across real-world videos. Reported as multiple-choice accuracy from a video model aligned with an LLM.