ClawMark

A Living-World Benchmark for Multi-Day, Multimodal Coworker Agents

by ClawMark Team and Evolvent AI

A benchmark for coworker agents built to work alongside humans across multiple working days, multiple services, and raw multimodal evidence. 100 tasks across 13 professional domains with fully rule-based scoring — no LLM-as-judge.

Browse Benchmark Tasks View Leaderboard GitHub

See ClawMark in Action

Watch how ClawMark evaluates AI coworker agents across real-world professional tasks.

100

Benchmark Tasks

Task Domains

Models Evaluated

Task Domains

View all 13 domains →

Models Ranking

View all models →

#	Model	Score	Total Input Tokens	Total Output Tokens	Total Cost (est.)
1	GPT-5.4	55.0	90.6M	1.7M	$252.41
2	Claude 4.6 Sonnet	54.9	303.0M	2.5M	$946.19
3	Qwen 3.6 Plus	49.8	289.1M	3.6M	$100.95
4	Gemini 3.1 Pro Preview	39.3	162.4M	0.7M	$333.52
5	MiniMax M2.7	34.4	169.9M	1.8M	$53.15

Featured Tasks

View all tasks →

Research

15 tasks

Insurance

7 tasks

Content Ops

12 tasks

HR

11 tasks

E-Commerce

9 tasks

Journalism

8 tasks