Reasoning Benchmarks
MMLU, ARC, HellaSwag, and other reasoning evaluations tracking logical inference and knowledge retrieval across model families.
Performance benchmarks across reasoning, coding, math, and multimodal tasks for frontier and open-source models. Automated weekly updates from public leaderboards.
View Dataset on GitHub →MMLU, ARC, HellaSwag, and other reasoning evaluations tracking logical inference and knowledge retrieval across model families.
HumanEval, MBPP, SWE-bench, and code generation evaluations measuring programming capability across languages and complexity levels.
GSM8K, MATH, and competition-level mathematics evaluations tracking numerical reasoning and problem-solving performance.
Vision-language evaluations, image understanding, and cross-modal reasoning benchmarks for multimodal AI systems.
Data is collected weekly via automated pipelines from public leaderboards, model cards, academic papers, and reproducible evaluation frameworks. All collection scripts are transparent and auditable.