Leaderboard
Compare model performance across GMP knowledge and task completion benchmarks. Click a model name to view detailed results.
Task type
Creator
Weights
| # | Model | Overall | Knowledge QA | Task Completion | Avg Latency | Total Tokens | # Evals |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 95.5% | 100.0% | 90.9% | 32.6s | 69k | 39 |
| 2 | GPT-5.4 | 94.2% | 100.0% | 88.4% | 8.3s | 27k | 39 |
| 3 | Claude Sonnet 4.6 | 94.2% | 100.0% | 88.3% | 27.6s | 73k | 39 |
| 4 | Claude Haiku 4.5 | 93.0% | 100.0% | 86.1% | 10.3s | 56k | 39 |
| 5 | DeepSeek-R1 | 92.7% | 97.1% | 88.2% | 19.8s | 33k | 39 |
| 6 | Gemini 3.1 Pro | 91.2% | 97.1% | 85.2% | 22.8s | 54k | 39 |
| 7 | GPT-5.4 mini | 90.5% | 100.0% | 81.0% | 2.6s | 21k | 39 |
| 8 | DeepSeek-V3.2 | 90.4% | 100.0% | 80.8% | 20.1s | 23k | 39 |
| 9 | Mistral Large 3 675B | 90.1% | 100.0% | 80.2% | 10.2s | 30k | 39 |
| 10 | Gemini 3 Flash | 87.5% | 100.0% | 75.0% | 3.9s | 21k | 39 |
| 11 | Mistral Small 2603 | 87.1% | 100.0% | 74.2% | 3.1s | 26k | 39 |
| 12 | GPT-5.4 nano | 86.1% | 94.3% | 77.8% | 2.8s | 23k | 39 |
| 13 | Llama 4 Maverick | 84.1% | 100.0% | 68.3% | 15.2s | 25k | 39 |
| 14 | Gemini 3.1 Flash-Lite | 78.7% | 97.1% | 60.3% | 2.4s | 17k | 39 |
| 15 | DeepSeek-R1-Distill-Qwen-32B | 77.8% | 100.0% | 55.6% | 45.6s | 33k | 39 |
| 16 | Llama 3.3 70B Instruct | 77.7% | 94.3% | 61.2% | 60.6s | 23k | 39 |
| 17 | Qwen3.5-397B-A17B | 77.4% | 65.7% | 89.2% | 30.8s | 81k | 39 |
| 18 | Llama 4 Scout | 75.0% | 97.1% | 52.8% | 4.5s | 23k | 39 |
| 19 | Qwen3.5-35B-A3B | 70.9% | 54.3% | 87.5% | 30.1s | 174k | 39 |