Documentation
Everything you need to run, evaluate, and contribute to ClawMark.
Overview
ClawMark is a multimodal · multi-stage · multi-environmentdaily-work benchmark for coworker agents. 100 tasks span 13 professional domains (clinical, HR, legal, PM, real estate, research assistant, journalist, insurance, investment analyst, executive assistant, content operation, ecommerce, EDA). Each task simulates 1–3 working days of a real job and stress-tests the model's ability to make continuous decisions across tools, multimodal evidence, and timelines.
Features
- Timeline-driven multi-stage tasks— Each task is built from 1–3 stages, where each stage corresponds to one working day. The agent receives that day's instructions, carries out the work against real tool backends, and only then does the framework advance to the next day.
- Cross-environment tool coordination — Tasks mix filesystem, email (GreenMail), Notion (mock), Google Sheets (mock), and Calendar (Radicale CalDAV) backends, forcing the model to cross-reference and reconcile information across multiple systems.
- Multimodal raw evidence —
assets/input/contains screenshots, photos, PDFs, CSVs, audio, and video. The model has to extract key information directly from raw evidence — there are no pre-digested text summaries. - Implicit state changes — Environment data mutates between stages (new email arrives, database rows get updated, files are appended, calendar events shift). The model has to proactively refresh external state rather than just react to the latest instruction.
- Strict rule-based scoring — Every task ships with 10–25 deterministic Python checker functions. Zero LLM-as-judge. Results are 100% reproducible.
Quick Start
The API key, base URL, and model name are all read from .env; they never appear on the command line.
Environment Setup
Key fields in your .env file:
Both the Notion and Google Sheets credential files are git-ignored and have to be bootstrapped locally once. Full step-by-step instructions are in docs/credentials-setup.md.
Project Structure
ClawMark/ ├── src/clawmark/ # framework core (orchestrator, task_loader, state managers) ├── docker/ # Dockerfile + docker-compose.yaml ├── configs/ # openclaw.yaml, Google OAuth credentials, etc. ├── skills/ # tool docs injected into the agent container (email / notion / sheets / calendar) ├── tasks/ # 100 benchmark tasks (see Task Layout) └── tests/ # backend smoke-test scripts
Task Layout
The tasks directory follows a strict two-level structure:
tasks/
└── {domain}/ # one of 13 professional domains
└── task{N}/ # sequentially numbered starting at task1
├── task.py # ★ the only file the runtime loads
├── task_summary.txt # ~50-word human summary (goal + timeline)
├── assets/ # uploaded to /workspace/ at stage0
│ ├── IDENTITY.md / SOUL.md / AGENTS.md / TOOLS.md / USER.md
│ └── input/ # raw multimodal evidence (PDF / image / audio / CSV)
└── inject/ # optional: mid-task file drops
├── stage1/...
└── stage2/...The runtime loads task.py and nothing else. task_summary.txt is a display-only helper for browsing and review — it has zero effect on evaluation behavior.
Adding a Task
Create tasks/{domain}/task{N}/task.py:
METADATA = {
"id": "{domain}_task{N}", # must match the folder path
"name": "...",
"category": "{domain}",
"environments": ["filesystem", "email", "notion", "google_sheets"],
"role": "...",
"env_config": {
"email": {"users": {...}},
"google_sheets": {"task_id": "{domain}_task{N}"},
},
}
PROMPT = "one-sentence task framing sent to the model at stage0"
async def stage0(ctx):
# 1) Upload assets/ into /workspace/
await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")
# 2) Seed external backends via ctx.notion / ctx.email / ctx.google_sheets / ctx.calendar
await ctx.notion.create_database(...)
await ctx.email.send_email(...)
# 3) Return the day's instructions plus an in-universe timestamp
return {
"notification": "[Mon 3/16 09:00] Today's priority: ...",
"time": "2026-03-16T09:00:00+08:00",
}
async def stage1(ctx): ...
async def stage2(ctx): ...
# ── Checker Functions ─────────────────────────────────────────────
async def _s0_xxx(ctx): ...
async def _s1_xxx(ctx): ...
RUBRIC = {
"stage0": [{"id": "S0_xxx", "checker": _s0_xxx, "weight": 2.0}, ...],
"stage1": [...],
"final": [...],
}Tasks are discovered automatically by walking the filesystem — there is no registry to update. The framework uses METADATA["id"] from each task.py as the canonical task ID.
Scoring Methodology
Scoring is entirely rule-based — no LLM judge. Each task ships with 10–25 deterministic Python checker functions. Each checker is an async def(ctx) → boolthat queries environment state (calendar events, email contents, Notion rows, filesystem files) — never the agent's intermediate steps, only final outcomes.
Results are 100% reproducible.
Metric Definitions
| Metric | Definition |
|---|---|
| avg@3 | Each task is run 3 times independently; the 3 score values are averaged, then averaged again across the 100 tasks. |
| turns / task | Number of assistant messages (model–tool interaction rounds) per task, averaged over 3 runs and then across all 100 tasks. |
| input tokens / task | Sum of prompt tokens across every turn. cacheRead and cacheWrite are merged into input, so the number reflects the total context the model actually had to process. |
| output tokens / task | Sum of completion tokens across every turn. |
Results Format
Every run writes its output into results/<task_id>/:
results/content_operation_task1/ result.json # weighted score + per-checker pass/fail + execution time messages.jsonl # full agent conversation + tool-call trace, one message per line workspace/ # final snapshot of the agent's working directory
Top-level fields in result.json: task_id, score (0–1), execution_time, stages, rubric.
Backend Smoke Tests
Before running real tasks you can verify each mock backend in isolation:
# Start local services (the runtime launches its own isolated copies later) docker compose -f docker/docker-compose.yaml up -d uv run python tests/test_email_lifecycle.py uv run python tests/test_calendar_lifecycle.py uv run python tests/test_notion_lifecycle.py uv run python tests/test_google_sheets_lifecycle.py uv run python tests/test_google_sheets_full.py # full round-trip including a real model call docker compose -f docker/docker-compose.yaml down
