Crosshair

Benchmarks

The tests behind the rankings. Each card explains what the benchmark measures and how scores are oriented.

By industry

Who leads each professional domain, scored across its benchmarks. Mappings are proposals (a benchmark can inform several industries) — open one for the full breakdown.

Software Engineering

2 benchmarks

Shipping working code against real repositories: bug fixes, feature patches, and competitive programming under tests.

AnthropicClaude Opus 4.888.6
AnthropicClaude Opus 4.787.6
DeepSeekDeepSeek V4-Pro87.1
Alibaba QwenQwen3.7 Max86.0
DeepSeekDeepSeek V4-Flash85.3
View industry & benchmarks

Investment Banking

3 benchmarks

Financial analysis over filings and credit agreements — valuation math, document QA, and the quantitative reasoning behind deals.

DeepSeekDeepSeek V4-Flash87.2
NVIDIANemotron 3 Ultra86.9
Alibaba QwenQwen3.6-35B-A3B85.6
AmazonNova 2 Pro81.5
OpenAIGPT-5.581.0
View industry & benchmarks

Corporate Law

3 benchmarks

Legal reasoning — issue spotting, rule application, and contract analysis — plus the broad knowledge a generalist counsel needs.

OpenAIGPT-5.486.0
xAIGrok 4.384.5
AnthropicClaude Sonnet 4.682.1
AmazonNova 2 Pro81.6
AnthropicClaude Haiku 4.581.2
View industry & benchmarks

Medicine

3 benchmarks

Clinical knowledge and diagnostic reasoning, including the medical coding accuracy and science depth that real practice demands.

DeepSeekDeepSeek V4-Flash87.2
Alibaba QwenQwen3.6-27B87.0
NVIDIANemotron 3 Ultra86.9
Alibaba QwenQwen3.6-35B-A3B85.6
Moonshot AIKimi K2 Thinking84.6
View industry & benchmarks

Scientific Research

3 benchmarks

Frontier problem solving — graduate-level science, the hardest multi-domain exams, and broad expert knowledge.

AmazonNova 2 Pro81.5
Meta AILlama 4 Maverick75.2
DeepSeekDeepSeek V4-Pro71.8
AnthropicClaude Opus 4.871.7
AnthropicClaude Opus 4.770.6
View industry & benchmarks

Management Consulting

3 benchmarks

Broad analytical reasoning across business domains — structured problem solving over wide-ranging knowledge.

AmazonNova 2 Pro81.5
Meta AILlama 4 Maverick75.2
DeepSeekDeepSeek V4-Pro71.8
AnthropicClaude Opus 4.871.7
AnthropicClaude Opus 4.770.6
View industry & benchmarks

Accounting & Audit

3 benchmarks

Numerically exact work over tax and financial documents — reconciliation, controls, and the arithmetic discipline audits demand.

NVIDIANemotron 3 Ultra86.8
DeepSeekDeepSeek V4-Flash86.2
Alibaba QwenQwen3.6-35B-A3B85.2
AmazonNova 2 Pro81.6
DeepSeekDeepSeek V4-Pro73.7
View industry & benchmarks

Language Models

Reasoning23 results

GPQA Diamond

Graduate-level, Google-proof science questions (physics, chemistry, biology) written by domain experts to resist web lookup.

% · higher is better
Frontier22 results

Humanity's Last Exam

A broad, extremely difficult exam across math, humanities, and science designed to remain unsaturated by frontier models. Reported here without external tools.

% · higher is better
Agentic Coding20 results

SWE-bench Verified

A human-validated subset of real GitHub issues; the model must produce a patch that passes the repository's tests. Figures are vendor-reported unless noted.

% · higher is better
Coding13 results

LiveCodeBench

Contamination-resistant competitive-programming problems collected over time to avoid training-set overlap.

% · higher is better
Knowledge10 results

MMLU-Pro

A harder, cleaned-up successor to MMLU spanning 57+ subjects with 10-way multiple choice and reasoning-heavy items.

% · higher is better
Human Preference25 results

LMArena Elo

Crowd-sourced pairwise preference rating from blind head-to-head chats (LMArena, formerly LMSYS). Unbounded; ~1000 is the historical anchor.

elo · higher is better
Composite31 results

AA Intelligence Index

Artificial Analysis Intelligence Index (v4.0) — an independent composite across ~10 evaluations (incl. GPQA Diamond, HLE, Terminal-Bench, SciCode, GDPval, τ²-Bench). The de-facto cross-model standard; higher is better. Shown for reference and normalized relative to this set.

pts · higher is better
Finance26 results

Corporate Finance

Vals AI CorpFin v2 — expert-built questions over long-context corporate credit agreements; an independent, in-house-run finance benchmark.

% · higher is better
Law24 results

LegalBench

Legal-reasoning task suite (originated by Stanford CodeX), run independently by Vals AI and reported as overall accuracy across tasks.

% · higher is better
Tax & Accounting25 results

TaxEval

Vals AI TaxEval v2 — 1,500+ expert-written tax questions, scored on overall accuracy. Independent, in-house-run.

% · higher is better
Medicine21 results

Medical Coding

Vals AI MedCode — accuracy of ICD-10-CM diagnosis coding for the medical billing process. Independent, expert-built dataset.

% · higher is better

World Models

emerging