Benchmarks
Industry2 benchmarks
Software Engineering
Shipping working code against real repositories: bug fixes, feature patches, and competitive programming under tests.
The Software Engineeringscore is the mean of a model’s normalized 0–100 scores (direction-aware, so lower-is-better metrics are inverted) across the 2 benchmarks below — the same figure the leaderboard’s industry view ranks by.
Leaders
Claude Opus 4.8 leads this industry with a score of 88.6.
Benchmarks in this score
Each model’s scores on these are normalized and averaged to produce the industry score above.
Agentic Coding%
SWE-bench Verified
A human-validated subset of real GitHub issues; the model must produce a patch that passes the repository's tests. Figures are vendor-reported unless noted.
Coding%
LiveCodeBench
Contamination-resistant competitive-programming problems collected over time to avoid training-set overlap.
