Understanding AI Benchmarks
Standardized tests that measure how well AI models perform across different cognitive domains. These benchmarks help compare models objectively.
Massive Multitask Language Understanding. Tests broad academic knowledge across 57 subjects including STEM, humanities, and social sciences. Variants include MMMLU (multilingual) and MMLU-Pro (harder).
Graduate-Level Google-Proof Q&A. Expert-level science questions designed to be unsearchable, testing genuine reasoning ability.
OpenAI’s code generation benchmark. Tests the ability to write correct Python functions from docstrings and function signatures.
American Invitational Mathematics Examination. Competition-level math problems requiring multi-step numerical reasoning.
Massive Multi-discipline Multimodal Understanding. Tests vision-language reasoning across college-level subjects with images.
Instruction Following Evaluation. Measures how precisely models follow specific formatting, length, and structural constraints.
Overall Leaderboard
Flagship models from each provider, ranked by average score across all available benchmark categories.
Flagship Model Comparison
Visual comparison of the top flagship models across all six benchmark dimensions.
Overall Average Score
Average performance across all six benchmark dimensions for each provider’s flagship model.
Radar Comparison: Top 4 Models
Knowledge Scores (MMLU)
Reasoning Scores (GPQA)
Coding Scores (HumanEval)
Explore by Provider
Dive deep into each AI company’s model history, evolution, and benchmark performance over time.
About This Data
Benchmark scores are compiled from official provider reports, technical papers, and independent evaluations. Scores may vary between evaluation runs. Some models have not been tested on all benchmarks (shown as N/A). This page is updated regularly as new models and benchmark results become available.
Benchmark scores measure specific capabilities, not overall model quality for your use case. The best model depends on your specific needs.
Learn How to Prompt These Models
Knowing benchmarks is just the start. Master the art of communicating with AI through 177+ proven prompting techniques.