GMP Bench

Leaderboard

Compare model performance across GMP knowledge and task completion benchmarks. Click a model name to view detailed results.

Task type
Creator
Weights
#ModelOverallKnowledge QATask CompletionAvg LatencyTotal Tokens# Evals
1Claude Opus 4.695.5%100.0%90.9%32.6s69k39
2GPT-5.494.2%100.0%88.4%8.3s27k39
3Claude Sonnet 4.694.2%100.0%88.3%27.6s73k39
4Claude Haiku 4.593.0%100.0%86.1%10.3s56k39
5DeepSeek-R192.7%97.1%88.2%19.8s33k39
6Gemini 3.1 Pro91.2%97.1%85.2%22.8s54k39
7GPT-5.4 mini90.5%100.0%81.0%2.6s21k39
8DeepSeek-V3.290.4%100.0%80.8%20.1s23k39
9Mistral Large 3 675B90.1%100.0%80.2%10.2s30k39
10Gemini 3 Flash87.5%100.0%75.0%3.9s21k39
11Mistral Small 260387.1%100.0%74.2%3.1s26k39
12GPT-5.4 nano86.1%94.3%77.8%2.8s23k39
13Llama 4 Maverick84.1%100.0%68.3%15.2s25k39
14Gemini 3.1 Flash-Lite78.7%97.1%60.3%2.4s17k39
15DeepSeek-R1-Distill-Qwen-32B77.8%100.0%55.6%45.6s33k39
16Llama 3.3 70B Instruct77.7%94.3%61.2%60.6s23k39
17Qwen3.5-397B-A17B77.4%65.7%89.2%30.8s81k39
18Llama 4 Scout75.0%97.1%52.8%4.5s23k39
19Qwen3.5-35B-A3B70.9%54.3%87.5%30.1s174k39