← Back to Test Cases
EM Excursion Investigation Report
Task CompletionEm Reportmedium
Cross-Model Comparison
| Model | Score | Latency | Tokens In | Tokens Out |
|---|---|---|---|---|
| Claude Opus 4.6 | 90.9% | 186.4s | 333 | 8,303 |
| Claude Sonnet 4.6 | 89.6% | 152.4s | 333 | 8,388 |
| Claude Haiku 4.5 | 89.5% | 57.2s | 332 | 6,965 |
| Qwen3.5-397B-A17B | 88.6% | 59.2s | 340 | 4,168 |
| Gemini 3.1 Pro | 87.3% | 46.8s | 322 | 3,903 |
| DeepSeek-V3.2 | 85.9% | 80.0s | 282 | 1,945 |
| DeepSeek-R1 | 83.4% | 54.2s | 282 | 1,965 |
| Qwen3.5-35B-A3B | 83.4% | 30.6s | 342 | 4,056 |
| Gemini 3 Flash | 81.5% | 8.5s | 320 | 1,286 |
| GPT-5.4 | 80.5% | 48.4s | 290 | 3,345 |
| GPT-5.4 mini | 80.5% | 12.3s | 290 | 2,194 |
| Mistral Large 3 675B | 78.1% | 48.9s | 325 | 3,107 |
| GPT-5.4 nano | 75.1% | 18.6s | 290 | 3,105 |
| Mistral Small 2603 | 70.5% | 8.1s | 337 | 1,376 |
| Gemini 3.1 Flash-Lite | 68.9% | 7.0s | 322 | 969 |
| Llama 4 Maverick | 65.8% | 43.2s | 289 | 1,006 |
| DeepSeek-R1-Distill-Qwen-32B | 62.3% | 93.7s | 320 | 1,688 |
| Llama 3.3 70B Instruct | 62.3% | 7.8s | 295 | 881 |
| Llama 4 Scout | 57.0% | 11.2s | 289 | 855 |
Tags
environmental_monitoringdeviationinvestigation