About & methodology

Crosshair Intelligence is an open leaderboard built on one principle: a benchmark number is only as good as the source behind it. Here is how the board works and where it is going.

How scoring works

Models are compared on a shared set of benchmarks per category. Raw scores live in different units — accuracies, pass rates, Elo, Fréchet distances — so each is normalized to a 0–100 scale:

Bounded metrics (anything with a known maximum, like a percentage) are scaled against that maximum.
Unbounded metrics (Elo, FVD, the AA Intelligence Index) are scaled by their min–max within the column.
Lower-is-better metrics (e.g. FVD) are inverted, so a higher normalized value always means “better.”

The Crosshair Indexaverages a model’s standing relative to the fieldon each benchmark — its min–max position within that column — rather than its raw normalized score. This keeps any one benchmark from dominating by its absolute difficulty: an easy benchmark where everyone clusters near 90% can’t outweigh a brutal one like Humanity’s Last Exam where everyone clusters near 40%, and a model can’t climb just by reporting more of the easy ones. Benchmarks a model hasn’t disclosed aren’t dropped from the average either — they’re imputed at a mild, slightly-below-median floor, so omitting a benchmark can only ever cost a model, never lift it. The index is shown only when a model covers at least 40% of its category’s benchmarks — otherwise a single cherry-picked result could carry it. The Artificial Analysis Intelligence Index is itself a composite, included here as one input alongside the individual benchmarks.

Data policy & honesty

Sourced & cited, not independently verified. Figures are sourced from vendor model cards, Artificial Analysis, and the LMArena (arena.ai) & SWE-bench leaderboards as of June 2026, and are cited per cell. They are vendor- or third-party-reported and have NOT been independently reproduced by Crosshair (every score is marked unverified). Benchmarks and harnesses differ between vendors, so treat cross-model comparisons as directional. World-model figures are sparser still and come from non-overlapping benchmark suites, so that category shows per-benchmark cells only and is not combined into a composite index. Treat every number as a starting point, not a verdict. Last updated 2026-06-04.

Every score records where it came from. We distinguish four kinds of source, and only reproduce-it-ourselves results are ever marked verified:

vendor

Self-reported by the model's creator. Useful, but unaudited and often run under favorable conditions.

paper

Published in a paper or technical report (arXiv, model card with methodology).

third-party

Measured by an independent evaluator — an arena, a lab, or a standardized harness.

crosshair-eval

Run by Crosshair's own evaluation harness (phase 2). These are the only figures we mark verified.

Roadmap

Phase 1 — Curated leaderboard

now

Static, transparent dataset of models, benchmarks, and scores.
Every score carries a cited source and a verified flag (current data is sourced but not independently reproduced).
Direction-aware normalization and the composite Crosshair Index.

Phase 2 — Live evaluations

Run benchmarks ourselves through the Vercel AI Gateway (one key, every provider).
Promote reproduced figures to source kind crosshair-eval, verified: true.
Re-run on a schedule so the board tracks new releases automatically.

Phase 3 — World models

later

Track world models (V-JEPA 2, Cosmos, Veo 3, Wan, HunyuanVideo, Sora, Genie 3, Marble, Ray 3) as a cited roster, attaching per-benchmark scores wherever models actually publish them.
Surface fragmented suites — video understanding & anticipation (Something-Something v2, EPIC-Kitchens-100, Perception Test), generative physics (Physics-IQ), and physical-AI generation (PAI-Bench) — without faking a composite the data can't support.
Co-develop benchmark definitions with the community as the field consolidates.

Contribute data

The dataset is plain TypeScript. To correct a number or add a model, edit the files under src/data/ and open a pull request:

models.ts — add the model with its provider, modalities, and license.
benchmarks.ts — define the benchmark, its metric, and direction.
scores.ts — add scores with a real source link; leave verified false unless you reproduced it.

Prefer primary sources (papers, model cards, standardized harnesses) over screenshots, and note the evaluation conditions where they matter.

← Back to the leaderboard