Model Leaderboard

Scores are avg@3 — each task run 3 independent times, averaged across 100 ClawMark tasks.

Rank by:

#	Model	Score (Avg@3)	Run 1	Run 2	Run 3
1	GPT-5.4 OpenAIClosed Source	56.7%	56.7%	54.2%	54.2%
2	Claude Sonnet 4.6 AnthropicClosed Source	53.1%	53.1%	55.8%	55.9%
3	Qwen3.6 Plus AlibabaOpen Source	48.7%	48.7%	48.9%	51.8%
4	Gemini 3.1 Pro Preview GoogleClosed Source	38.9%	38.9%	39.3%	39.8%
5	MiniMax M2.7 MiniMaxOpen Source	33.0%	33.0%	33.9%	36.2%

Per-Domain Scores

Mean task score in each role domain.

Model	Clinical	Content Ops	E-Commerce	EDA	Exec Assistant	HR	Insurance	Investment	Journalist	Legal	PM	Real Estate	Research
GPT-5.4	70.0%	57.8%	51.2%	39.1%	50.0%	58.0%	74.4%	57.7%	56.2%	32.3%	35.0%	79.6%	63.0%
Claude Sonnet 4.6	65.4%	50.5%	49.3%	43.5%	39.1%	49.4%	77.2%	33.4%	52.3%	47.9%	45.3%	49.7%	68.6%
Qwen3.6 Plus	61.9%	66.0%	40.0%	87.0%	50.8%	51.3%	55.7%	6.9%	34.0%	31.9%	34.4%	64.0%	61.0%
Gemini 3.1 Pro Preview	33.1%	31.5%	30.4%	91.3%	44.9%	26.4%	74.2%	52.8%	41.9%	44.6%	14.9%	67.9%	29.5%
MiniMax M2.7	48.1%	27.9%	13.8%	8.7%	29.6%	35.1%	63.3%	60.5%	20.4%	23.5%	25.5%	49.2%	29.3%

Each task is run 3 independent times per model. Score is avg@3 — the mean of the 3 per-run scores, averaged across all 100 tasks. Scores are weighted by rubric check weights.