Benchmarks
Industry3 benchmarks
Scientific Research
Frontier problem solving — graduate-level science, the hardest multi-domain exams, and broad expert knowledge.
The Scientific Researchscore is the mean of a model’s normalized 0–100 scores (direction-aware, so lower-is-better metrics are inverted) across the 3 benchmarks below — the same figure the leaderboard’s industry view ranks by.
Leaders
Nova 2 Pro leads this industry with a score of 81.5.
Benchmarks in this score
Each model’s scores on these are normalized and averaged to produce the industry score above.
Reasoning%
GPQA Diamond
Graduate-level, Google-proof science questions (physics, chemistry, biology) written by domain experts to resist web lookup.
Frontier%
Humanity's Last Exam
A broad, extremely difficult exam across math, humanities, and science designed to remain unsaturated by frontier models. Reported here without external tools.
Knowledge%
MMLU-Pro
A harder, cleaned-up successor to MMLU spanning 57+ subjects with 10-way multiple choice and reasoning-heavy items.
