Crosshair

About & methodology

Crosshair Intelligence is an open leaderboard built on one principle: a benchmark number is only as good as the source behind it. Here is how the board works and where it is going.

How scoring works

Models are compared on a shared set of benchmarks per category. Raw scores live in different units — accuracies, pass rates, Elo, Fréchet distances — so each is normalized to a 0–100 scale:

  • Bounded metrics (anything with a known maximum, like a percentage) are scaled against that maximum.
  • Unbounded metrics (Elo, FVD, the AA Intelligence Index) are scaled by their min–max within the column.
  • Lower-is-better metrics (e.g. FVD) are inverted, so a higher normalized value always means “better.”

The Crosshair Indexis the mean of a model’s normalized scores, shown only when the model covers at least 40% of its category’s benchmarks — otherwise a model with one cherry-picked result could top the chart. The Artificial Analysis Intelligence Index is itself a composite, included here as one normalized input alongside the individual benchmarks.

Data policy & honesty

Sourced & cited, not independently verified. Figures are sourced from vendor model cards, Artificial Analysis, and the LMArena (arena.ai) & SWE-bench leaderboards as of June 2026, and are cited per cell. They are vendor- or third-party-reported and have NOT been independently reproduced by Crosshair (every score is marked unverified). Benchmarks and harnesses differ between vendors, so treat cross-model comparisons as directional. Treat every number as a starting point, not a verdict. Last updated 2026-06-04.

Every score records where it came from. We distinguish four kinds of source, and only reproduce-it-ourselves results are ever marked verified:

vendor

Self-reported by the model's creator. Useful, but unaudited and often run under favorable conditions.

paper

Published in a paper or technical report (arXiv, model card with methodology).

third-party

Measured by an independent evaluator — an arena, a lab, or a standardized harness.

crosshair-eval

Run by Crosshair's own evaluation harness (phase 2). These are the only figures we mark verified.

Roadmap

Phase 1 — Curated leaderboard

now
  • Static, transparent dataset of models, benchmarks, and scores.
  • Every score carries a cited source and a verified flag (current data is sourced but not independently reproduced).
  • Direction-aware normalization and the composite Crosshair Index.

Phase 2 — Live evaluations

next
  • Run benchmarks ourselves through the Vercel AI Gateway (one key, every provider).
  • Promote reproduced figures to source kind crosshair-eval, verified: true.
  • Re-run on a schedule so the board tracks new releases automatically.

Phase 3 — World models

later
  • Stand up harnesses for physical prediction, planning, and video coherence.
  • Track V-JEPA, Genie, Cosmos, and newcomers like Kona as results mature.
  • Co-develop benchmark definitions with the community as the field consolidates.

Contribute data

The dataset is plain TypeScript. To correct a number or add a model, edit the files under src/data/ and open a pull request:

  • models.ts — add the model with its provider, modalities, and license.
  • benchmarks.ts — define the benchmark, its metric, and direction.
  • scores.ts — add scores with a real source link; leave verified false unless you reproduced it.

Prefer primary sources (papers, model cards, standardized harnesses) over screenshots, and note the evaluation conditions where they matter.

← Back to the leaderboard