Model Leaderboard
Scores are avg@3 — each task run 3 independent times, averaged across 100 ClawMark tasks.
Rank by:
| # | Model | Score (Avg@3) |
|---|---|---|
| 1 | GPT-5.4 OpenAIClosed Source | 56.7% |
| 2 | Claude Sonnet 4.6 AnthropicClosed Source | 53.1% |
| 3 | Qwen3.6 Plus AlibabaOpen Source | 48.7% |
| 4 | Gemini 3.1 Pro Preview GoogleClosed Source | 38.9% |
| 5 | MiniMax M2.7 MiniMaxOpen Source | 33.0% |
Per-Domain Scores
Mean task score in each role domain.
| Model | Clinical | Content Ops | E-Commerce | EDA | Exec Assistant | HR | Insurance | Investment | Journalist | Legal | PM | Real Estate | Research |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.4 | 70.0% | 57.8% | 51.2% | 39.1% | 50.0% | 58.0% | 74.4% | 57.7% | 56.2% | 32.3% | 35.0% | 79.6% | 63.0% |
| Claude Sonnet 4.6 | 65.4% | 50.5% | 49.3% | 43.5% | 39.1% | 49.4% | 77.2% | 33.4% | 52.3% | 47.9% | 45.3% | 49.7% | 68.6% |
| Qwen3.6 Plus | 61.9% | 66.0% | 40.0% | 87.0% | 50.8% | 51.3% | 55.7% | 6.9% | 34.0% | 31.9% | 34.4% | 64.0% | 61.0% |
| Gemini 3.1 Pro Preview | 33.1% | 31.5% | 30.4% | 91.3% | 44.9% | 26.4% | 74.2% | 52.8% | 41.9% | 44.6% | 14.9% | 67.9% | 29.5% |
| MiniMax M2.7 | 48.1% | 27.9% | 13.8% | 8.7% | 29.6% | 35.1% | 63.3% | 60.5% | 20.4% | 23.5% | 25.5% | 49.2% | 29.3% |
Each task is run 3 independent times per model. Score is avg@3 — the mean of the 3 per-run scores, averaged across all 100 tasks. Scores are weighted by rubric check weights.
