← Back to Test Cases
Write Equipment Cleaning SOP
Task CompletionSopmedium
Cross-Model Comparison
| Model | Score | Latency | Tokens In | Tokens Out |
|---|---|---|---|---|
| Claude Opus 4.6 | 93.7% | 232.0s | 194 | 12,123 |
| Claude Haiku 4.5 | 90.6% | 102.9s | 193 | 13,010 |
| GPT-5.4 | 90.3% | 64.4s | 175 | 3,697 |
| Claude Sonnet 4.6 | 90.2% | 269.8s | 194 | 16,000 |
| Qwen3.5-397B-A17B | 90.0% | 55.2s | 190 | 4,491 |
| Qwen3.5-35B-A3B | 88.8% | 30.9s | 190 | 3,884 |
| DeepSeek-R1 | 86.6% | 75.7s | 168 | 2,821 |
| Mistral Large 3 675B | 86.4% | 33.1s | 182 | 1,828 |
| Mistral Small 2603 | 83.1% | 10.9s | 194 | 1,806 |
| Gemini 3.1 Pro | 83.1% | 44.2s | 181 | 3,430 |
| GPT-5.4 nano | 82.5% | 14.3s | 175 | 2,911 |
| DeepSeek-V3.2 | 80.8% | 24.6s | 168 | 2,321 |
| GPT-5.4 mini | 77.5% | 14.4s | 175 | 2,423 |
| Gemini 3.1 Flash-Lite | 70.2% | 6.7s | 181 | 986 |
| Llama 4 Maverick | 68.8% | 33.5s | 180 | 849 |
| Gemini 3 Flash | 68.2% | 9.2s | 179 | 1,328 |
| Llama 3.3 70B Instruct | 65.8% | 28.8s | 183 | 865 |
| Llama 4 Scout | 61.8% | 7.8s | 180 | 697 |
| DeepSeek-R1-Distill-Qwen-32B | 50.2% | 57.8s | 176 | 1,193 |
Tags
sopcleaningbioreactor