About & methodology
Crosshair Intelligence is an open leaderboard built on one principle: a benchmark number is only as good as the source behind it. Here is how the board works and where it is going.
How scoring works
Models are compared on a shared set of benchmarks per category. Raw scores live in different units — accuracies, pass rates, Elo, Fréchet distances — so each is normalized to a 0–100 scale:
- Bounded metrics (anything with a known maximum, like a percentage) are scaled against that maximum.
- Unbounded metrics (Elo, FVD, the AA Intelligence Index) are scaled by their min–max within the column.
- Lower-is-better metrics (e.g. FVD) are inverted, so a higher normalized value always means “better.”
The Crosshair Indexis the mean of a model’s normalized scores, shown only when the model covers at least 40% of its category’s benchmarks — otherwise a model with one cherry-picked result could top the chart. The Artificial Analysis Intelligence Index is itself a composite, included here as one normalized input alongside the individual benchmarks.
Data policy & honesty
Every score records where it came from. We distinguish four kinds of source, and only reproduce-it-ourselves results are ever marked verified:
Self-reported by the model's creator. Useful, but unaudited and often run under favorable conditions.
Published in a paper or technical report (arXiv, model card with methodology).
Measured by an independent evaluator — an arena, a lab, or a standardized harness.
Run by Crosshair's own evaluation harness (phase 2). These are the only figures we mark verified.
Roadmap
Phase 1 — Curated leaderboard
now- Static, transparent dataset of models, benchmarks, and scores.
- Every score carries a cited source and a verified flag (current data is sourced but not independently reproduced).
- Direction-aware normalization and the composite Crosshair Index.
Phase 2 — Live evaluations
next- Run benchmarks ourselves through the Vercel AI Gateway (one key, every provider).
- Promote reproduced figures to source kind crosshair-eval, verified: true.
- Re-run on a schedule so the board tracks new releases automatically.
Phase 3 — World models
later- Stand up harnesses for physical prediction, planning, and video coherence.
- Track V-JEPA, Genie, Cosmos, and newcomers like Kona as results mature.
- Co-develop benchmark definitions with the community as the field consolidates.
Contribute data
The dataset is plain TypeScript. To correct a number or add a model, edit the files under src/data/ and open a pull request:
models.ts— add the model with its provider, modalities, and license.benchmarks.ts— define the benchmark, its metric, and direction.scores.ts— add scores with a real source link; leaveverifiedfalse unless you reproduced it.
Prefer primary sources (papers, model cards, standardized harnesses) over screenshots, and note the evaluation conditions where they matter.
