Crosshair
Benchmarks
Agentic Codinghigher is better

SWE-bench Verified

A human-validated subset of real GitHub issues; the model must produce a patch that passes the repository's tests. Figures are vendor-reported unless noted.

Benchmark source
Domain
Agentic Coding
Metric
%
Orientation
Higher is better
Results
20

Ranking

#ModelScoreSourceStatus
1Claude Opus 4.8
Anthropic
88.6%Anthropic — Claude Opus 4.8vendorunverified
2Claude Opus 4.7
Anthropic
87.6%Anthropic — Claude Opus 4.7vendorunverified
3Claude Opus 4.6
Anthropic
80.8%Anthropic — Claude Opus 4.6vendorunverified
4Gemini 3.1 Pro
Google DeepMind
80.6%Google DeepMind — Gemini 3.1 Pro model cardvendorunverified
5DeepSeek V4-Pro
DeepSeek
80.6%DeepSeek — V4-Pro model cardvendorunverified
6Qwen3.7 Max
Alibaba Qwen
80.4%Qwen — Qwen3.7 Maxvendorunverified
7Kimi K2.6
Moonshot AI
80.2%Moonshot — Kimi K2.6 model cardvendorunverified
8GPT-5.2
OpenAI
80%llm-stats — GPT-5.2 (vendor-reported)3rd-partyunverified
9Claude Sonnet 4.6
Anthropic
79.6%Anthropic — Claude Sonnet 4.6vendorunverified
10DeepSeek V4-Flash
DeepSeek
79%DeepSeek — V4-Flash model cardvendorunverified
11Gemini 3 Flash
Google DeepMind
78%Google — Gemini 3 Flashvendorunverified
12Qwen3.6-27B
Alibaba Qwen
77.2%Alibaba — Qwen3.6-27B model cardvendorunverified
13Gemini 3 Pro
Google DeepMind
76.2%Google — Gemini 3 Provendorunverified
14Qwen3.6-35B-A3B
Alibaba Qwen
73.4%Alibaba — Qwen3.6-35B-A3B model cardvendorunverified
15Claude Haiku 4.5
Anthropic
73.3%Anthropic — Claude Haiku 4.5vendorunverified
16DeepSeek V3.2
DeepSeek
73.1%DeepSeek — V3.2 technical reportvendorunverified
17Nemotron 3 Ultra
NVIDIA
71.9%NVIDIA — Nemotron 3 Ultra model cardvendorunverified
18Kimi K2 Thinking
Moonshot AI
71.3%Moonshot — Kimi K2 Thinking model cardvendorunverified
19Nova 2 Pro
Amazon
61.5%Amazon — Nova 2 technical reportvendorunverified
20Gemini 2.5 Pro
Google DeepMind
59.6%Google DeepMind — Gemini 2.5 Pro model cardvendorunverified