ClawMark
A Living-World Benchmark for Multi-Day, Multimodal Coworker Agents
by ClawMark Team and Evolvent AI
A benchmark for coworker agents built to work alongside humans across multiple working days, multiple services, and raw multimodal evidence. 100 tasks across 13 professional domains with fully rule-based scoring — no LLM-as-judge.
See ClawMark in Action
Watch how ClawMark evaluates AI coworker agents across real-world professional tasks.
100
Benchmark Tasks
13
Task Domains
5
Models Evaluated
Task Domains
View all 13 domains →Models Ranking
View all models →| # | Model | Score |
|---|---|---|
| 1 | GPT-5.4 | 55.0 |
| 2 | Claude 4.6 Sonnet | 54.9 |
| 3 | Qwen 3.6 Plus | 49.8 |
| 4 | Gemini 3.1 Pro Preview | 39.3 |
| 5 | MiniMax M2.7 | 34.4 |
