ClawMark

A Living-World Benchmark for Multi-Day, Multimodal Coworker Agents

by ClawMark Team and Evolvent AI

A benchmark for coworker agents built to work alongside humans across multiple working days, multiple services, and raw multimodal evidence. 100 tasks across 13 professional domains with fully rule-based scoring — no LLM-as-judge.

See ClawMark in Action

Watch how ClawMark evaluates AI coworker agents across real-world professional tasks.

100
Benchmark Tasks
13
Task Domains
5
Models Evaluated

Models Ranking

View all models →
#ModelScore
1
GPT-5.4
55.0
2
Claude 4.6 Sonnet
54.9
3
Qwen 3.6 Plus
49.8
4
Gemini 3.1 Pro Preview
39.3
5
MiniMax M2.7
34.4

Featured Tasks

View all tasks →