Documentation

Everything you need to run, evaluate, and contribute to ClawMark.

Overview

ClawMark is a multimodal · multi-stage · multi-environmentdaily-work benchmark for coworker agents. 100 tasks span 13 professional domains (clinical, HR, legal, PM, real estate, research assistant, journalist, insurance, investment analyst, executive assistant, content operation, ecommerce, EDA). Each task simulates 1–3 working days of a real job and stress-tests the model's ability to make continuous decisions across tools, multimodal evidence, and timelines.

Features

  • Timeline-driven multi-stage tasks— Each task is built from 1–3 stages, where each stage corresponds to one working day. The agent receives that day's instructions, carries out the work against real tool backends, and only then does the framework advance to the next day.
  • Cross-environment tool coordination — Tasks mix filesystem, email (GreenMail), Notion (mock), Google Sheets (mock), and Calendar (Radicale CalDAV) backends, forcing the model to cross-reference and reconcile information across multiple systems.
  • Multimodal raw evidence assets/input/ contains screenshots, photos, PDFs, CSVs, audio, and video. The model has to extract key information directly from raw evidence — there are no pre-digested text summaries.
  • Implicit state changes — Environment data mutates between stages (new email arrives, database rows get updated, files are appended, calendar events shift). The model has to proactively refresh external state rather than just react to the latest instruction.
  • Strict rule-based scoring — Every task ships with 10–25 deterministic Python checker functions. Zero LLM-as-judge. Results are 100% reproducible.

Quick Start

# 1. Clone and install
git clone https://github.com/evolvent-ai/ClawMark
cd ClawMark && uv sync
cp .env.example .env # then fill in your API keys
# 2. Build the Docker image
docker build -t clawmark-main:latest -f docker/Dockerfile docker/
# 3. Run a single task
uv run clawmark --task tasks/content_operation/task1
# Dry-run — no Docker, just exercise task.py and the framework plumbing
uv run clawmark --task tasks/content_operation/task1 --dry-run
# Run every task in one domain
uv run clawmark --tasks-dir tasks/content_operation
# Run the full 100-task suite
for domain in tasks/*/; do uv run clawmark --tasks-dir "$domain"; done

The API key, base URL, and model name are all read from .env; they never appear on the command line.

Environment Setup

Key fields in your .env file:

# Model API
ANTHROPIC_API_KEY=sk-...
ANTHROPIC_API_BASE=https://api.anthropic.com
MODEL=claude-sonnet-4-5-20250929
API_FORMAT=anthropic # or "openrouter" for OpenRouter-compatible endpoints
# Notion (required for tasks that use the notion environment)
NOTION_ADMIN_KEY=ntn_...
NOTION_AGENT_KEY=ntn_...
NOTION_SOURCE_PAGE=ClawMark Source Hub
NOTION_EVAL_PAGE=ClawMark Eval Hub
# Google Sheets (required for tasks that use the google_sheets environment)
GOOGLE_CREDENTIALS_PATH=configs/google_credentials.json

Both the Notion and Google Sheets credential files are git-ignored and have to be bootstrapped locally once. Full step-by-step instructions are in docs/credentials-setup.md.

Project Structure

ClawMark/
├── src/clawmark/        # framework core (orchestrator, task_loader, state managers)
├── docker/              # Dockerfile + docker-compose.yaml
├── configs/             # openclaw.yaml, Google OAuth credentials, etc.
├── skills/              # tool docs injected into the agent container (email / notion / sheets / calendar)
├── tasks/               # 100 benchmark tasks (see Task Layout)
└── tests/               # backend smoke-test scripts

Task Layout

The tasks directory follows a strict two-level structure:

tasks/
└── {domain}/                   # one of 13 professional domains
    └── task{N}/                # sequentially numbered starting at task1
        ├── task.py                  # ★ the only file the runtime loads
        ├── task_summary.txt         # ~50-word human summary (goal + timeline)
        ├── assets/                  # uploaded to /workspace/ at stage0
        │   ├── IDENTITY.md / SOUL.md / AGENTS.md / TOOLS.md / USER.md
        │   └── input/               # raw multimodal evidence (PDF / image / audio / CSV)
        └── inject/                  # optional: mid-task file drops
            ├── stage1/...
            └── stage2/...

The runtime loads task.py and nothing else. task_summary.txt is a display-only helper for browsing and review — it has zero effect on evaluation behavior.

Adding a Task

Create tasks/{domain}/task{N}/task.py:

METADATA = {
    "id": "{domain}_task{N}",                # must match the folder path
    "name": "...",
    "category": "{domain}",
    "environments": ["filesystem", "email", "notion", "google_sheets"],
    "role": "...",
    "env_config": {
        "email": {"users": {...}},
        "google_sheets": {"task_id": "{domain}_task{N}"},
    },
}

PROMPT = "one-sentence task framing sent to the model at stage0"


async def stage0(ctx):
    # 1) Upload assets/ into /workspace/
    await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")
    # 2) Seed external backends via ctx.notion / ctx.email / ctx.google_sheets / ctx.calendar
    await ctx.notion.create_database(...)
    await ctx.email.send_email(...)
    # 3) Return the day's instructions plus an in-universe timestamp
    return {
        "notification": "[Mon 3/16 09:00] Today's priority: ...",
        "time": "2026-03-16T09:00:00+08:00",
    }


async def stage1(ctx): ...
async def stage2(ctx): ...


# ── Checker Functions ─────────────────────────────────────────────

async def _s0_xxx(ctx): ...
async def _s1_xxx(ctx): ...


RUBRIC = {
    "stage0": [{"id": "S0_xxx", "checker": _s0_xxx, "weight": 2.0}, ...],
    "stage1": [...],
    "final":  [...],
}

Tasks are discovered automatically by walking the filesystem — there is no registry to update. The framework uses METADATA["id"] from each task.py as the canonical task ID.

Scoring Methodology

Scoring is entirely rule-based — no LLM judge. Each task ships with 10–25 deterministic Python checker functions. Each checker is an async def(ctx) → boolthat queries environment state (calendar events, email contents, Notion rows, filesystem files) — never the agent's intermediate steps, only final outcomes.

score = Σ(passed_weight) / Σ(total_weight)

Results are 100% reproducible.

Metric Definitions

MetricDefinition
avg@3Each task is run 3 times independently; the 3 score values are averaged, then averaged again across the 100 tasks.
turns / taskNumber of assistant messages (model–tool interaction rounds) per task, averaged over 3 runs and then across all 100 tasks.
input tokens / taskSum of prompt tokens across every turn. cacheRead and cacheWrite are merged into input, so the number reflects the total context the model actually had to process.
output tokens / taskSum of completion tokens across every turn.

Results Format

Every run writes its output into results/<task_id>/:

results/content_operation_task1/
  result.json        # weighted score + per-checker pass/fail + execution time
  messages.jsonl     # full agent conversation + tool-call trace, one message per line
  workspace/         # final snapshot of the agent's working directory

Top-level fields in result.json: task_id, score (0–1), execution_time, stages, rubric.

Backend Smoke Tests

Before running real tasks you can verify each mock backend in isolation:

# Start local services (the runtime launches its own isolated copies later)
docker compose -f docker/docker-compose.yaml up -d

uv run python tests/test_email_lifecycle.py
uv run python tests/test_calendar_lifecycle.py
uv run python tests/test_notion_lifecycle.py
uv run python tests/test_google_sheets_lifecycle.py
uv run python tests/test_google_sheets_full.py   # full round-trip including a real model call

docker compose -f docker/docker-compose.yaml down