All benchmarks
leaderboard view · CHI = mean of the benchmarks below (normalized 0–100, direction-aware)Reasoning, coding, knowledge, and multimodal understanding.
| # | Model | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 77.3 | 94.2% | 46.9% | 87.6% | — | — | 1,493 | 57 | 66.1% | 85.3% | 75.3% | 54.9% |
| 2 | Claude Opus 4.8 | 76.4 | 93.6% | 49.8% | 88.6% | — | — | — | 61 | 66.7% | 83.6% | 75.6% | 53.2% |
| 3 | Gemini 3.1 Pro | 75.9 | 94.3% | 44.4% | 80.6% | — | — | 1,488 | 57 | 64.5% | 87.4% | 72.9% | 59.1% |
| 4 | Claude Opus 4.6 | 74.7 | 91.3% | 40% | 80.8% | — | — | 1,498 | 53 | 67% | 85.3% | 76% | 49.1% |
| 5 | Qwen3.7 Max | 73.2 | 92.4% | 41.4% | 80.4% | 91.6% | — | 1,474 | 57 | 63.7% | 84.9% | 75.3% | 38.8% |
| 6 | GPT-5.5 | 73.0 | 93.6% | 41.4% | — | — | — | 1,474 | 60 | 68.4% | 86.5% | 75% | 49.1% |
| 7 | Claude Sonnet 4.6 | 72.8 | — | — | 79.6% | — | — | 1,471 | 44 | 65.3% | 82.1% | 77.1% | — |
| 8 | Gemini 3 Pro | 71.1 | 91.9% | 37.5% | 76.2% | — | — | 1,486 | 48 | 63.7% | 87% | 72.6% | 52.2% |
| 9 | DeepSeek V4-Pro | 70.7 | 90.1% | 37.7% | 80.6% | 93.5% | 87.5% | 1,457 | 52 | 61.4% | 80.3% | 72.1% | 40.5% |
| 10 | Kimi K2.6 | 70.5 | 90.5% | 34.7% | 80.2% | 89.6% | — | 1,462 | 54 | 66.7% | 84.7% | 74.7% | 40.1% |
| 11 | GPT-5.4 | 70.4 | — | — | — | — | — | 1,467 | 57 | 65.3% | 86% | 74% | 41.3% |
| 12 | Qwen3.6-27B | 70.1 | 87.8% | 24% | 77.2% | 83.9% | 86.2% | — | 46 | 62.3% | — | 71.3% | — |
| 13 | Gemini 3.5 Flash | 68.8 | — | 40.2% | — | — | — | 1,477 | 55 | 64.7% | 83.6% | 74.4% | 55.8% |
| 14 | Qwen3.6-35B-A3B | 68.0 | 86% | 21.4% | 73.4% | 80.4% | 85.2% | — | 43 | — | — | — | — |
| 15 | DeepSeek V4-Flash | 67.9 | 88.1% | 34.8% | 79% | 91.6% | 86.2% | 1,433 | 47 | — | — | — | — |
| 16 | Gemini 3 Flash | 66.8 | 90.4% | 33.7% | 78% | — | — | 1,473 | 35 | 66.4% | 86.9% | 73.9% | 55.9% |
| 17 | Muse Spark | 66.5 | — | 39.9% | — | — | — | — | 52 | 65.1% | 84.2% | 77.7% | 51.3% |
| 18 | GLM-5.1 | 66.4 | 86.2% | 31% | — | — | — | 1,475 | 51 | 64.5% | 84.4% | 71.2% | 41.6% |
| 19 | Grok 4.20 | 65.6 | — | — | — | — | — | 1,473 | 49 | 63.7% | 77.7% | 74.1% | 32.2% |
| 20 | Kimi K2 Thinking | 65.5 | 84.5% | 23.9% | 71.3% | 83.1% | 84.6% | 1,444 | 41 | 60.6% | 80.2% | 71.7% | — |
| 21 | GPT-5.2 | 65.3 | 92.4% | 34.5% | 80% | — | — | 1,435 | 51 | 65.9% | 82.8% | 75.8% | 49.7% |
| 22 | Grok 4.3 | 64.2 | — | — | — | — | — | 1,446 | 53 | 68.5% | 84.5% | 70.8% | 38.1% |
| 23 | Nemotron 3 Ultra | 63.8 | 87% | 26.7% | 71.9% | 89% | 86.8% | 1,422 | 48 | — | — | — | — |
| 24 | Nova 2 Pro | 63.6 | 81.4% | — | 61.5% | 74.6% | 81.6% | — | 23 | — | — | — | — |
| 25 | DeepSeek V3.2 | 61.2 | 82.4% | 25.1% | 73.1% | 83.3% | 85% | 1,437 | 32 | 51% | 76.1% | 68.2% | — |
| 26 | Gemini 2.5 Pro | 54.1 | 86.4% | 21.6% | 59.6% | 69% | — | 1,446 | 35 | 60.8% | — | — | 50.6% |
| 27 | Llama 4 Maverick | 54.1 | 69.8% | — | — | 43.4% | 80.5% | — | 18 | 49.7% | 77.8% | 66.6% | 36.5% |
| 28 | Claude Haiku 4.5 | 50.2 | — | — | 73.3% | — | — | 1,411 | 31 | 60.6% | 81.2% | 67.5% | 32.7% |
| 29 | Mistral Large 3 | 47.4 | 43.9% | — | — | — | — | 1,418 | 23 | 61% | 79.1% | 73.1% | — |
| 30 | Llama 4 Scout | 45.2 | 57.2% | — | — | 32.8% | 74.3% | — | 14 | 46.8% | 72% | 55.2% | 23.3% |
| 31 | Doubao Seed 2.0 Pro DProprietary | — | — | 54.2% | — | — | — | 1,455 | — | — | — | — | — |
| 32 | MiniMax M3 MProprietary | — | — | — | — | — | — | 1,449 | 55 | — | — | — | — |
Category bestNormalized 0–100 (direction-aware)Click a column to sort. All figures are sourced but unverified.