Model Leaderboard

Scores are avg@3 — each task run 3 independent times, averaged across 100 ClawMark tasks.

Rank by:
#ModelScore (Avg@3)
1
GPT-5.4
OpenAIClosed Source
56.7%
2
Claude Sonnet 4.6
AnthropicClosed Source
53.1%
3
Qwen3.6 Plus
AlibabaOpen Source
48.7%
4
Gemini 3.1 Pro Preview
GoogleClosed Source
38.9%
5
MiniMax M2.7
MiniMaxOpen Source
33.0%

Per-Domain Scores

Mean task score in each role domain.

ModelClinicalContent OpsE-CommerceEDAExec AssistantHRInsuranceInvestmentJournalistLegalPMReal EstateResearch
GPT-5.470.0%57.8%51.2%39.1%50.0%58.0%74.4%57.7%56.2%32.3%35.0%79.6%63.0%
Claude Sonnet 4.665.4%50.5%49.3%43.5%39.1%49.4%77.2%33.4%52.3%47.9%45.3%49.7%68.6%
Qwen3.6 Plus61.9%66.0%40.0%87.0%50.8%51.3%55.7%6.9%34.0%31.9%34.4%64.0%61.0%
Gemini 3.1 Pro Preview33.1%31.5%30.4%91.3%44.9%26.4%74.2%52.8%41.9%44.6%14.9%67.9%29.5%
MiniMax M2.748.1%27.9%13.8%8.7%29.6%35.1%63.3%60.5%20.4%23.5%25.5%49.2%29.3%

Each task is run 3 independent times per model. Score is avg@3 — the mean of the 3 per-run scores, averaged across all 100 tasks. Scores are weighted by rubric check weights.