Updated Weekly

AI Benchmark Index

Performance benchmarks across reasoning, coding, math, and multimodal tasks for frontier and open-source models. Automated weekly updates from public leaderboards.

View Dataset on GitHub →

What This Index Covers

🧠

Reasoning Benchmarks

MMLU, ARC, HellaSwag, and other reasoning evaluations tracking logical inference and knowledge retrieval across model families.

💻

Coding Benchmarks

HumanEval, MBPP, SWE-bench, and code generation evaluations measuring programming capability across languages and complexity levels.

📊

Math Benchmarks

GSM8K, MATH, and competition-level mathematics evaluations tracking numerical reasoning and problem-solving performance.

🖼

Multimodal Benchmarks

Vision-language evaluations, image understanding, and cross-modal reasoning benchmarks for multimodal AI systems.

Methodology

Data is collected weekly via automated pipelines from public leaderboards, model cards, academic papers, and reproducible evaluation frameworks. All collection scripts are transparent and auditable.

40+
Models Tracked
Weekly
Update Frequency
100%
Open Source