task_summary.txtJournalist ยท task5

Viral kitchen-video verification and data traceability for Liu Ying, testing SQL tracing and screenshot authenticity. 3/18 10:00: query food_inspection.db for failed records, vet each screenshot, alert the reporter on any bad image. 3/18 15:00: PR rectification report and test certificate arrive; a new re-inspection row lands in the database.

Model Runs

5 models evaluated on this task, 3 independent runs each.

ModelScore (Avg@3)Run 1Run 2Run 3
GPT-5.4
OpenAI
60.0%60.0%60.0%60.0%
Gemini 3.1 Pro Preview
Google
53.3%0.0%80.0%80.0%
Claude Sonnet 4.6
Anthropic
40.0%0.0%60.0%60.0%
Qwen3.6 Plus
Alibaba
26.7%30.0%20.0%30.0%
MiniMax M2.7
MiniMax
6.7%20.0%0.0%0.0%
Input Files9
๐Ÿ“„company_statement.pdf
Download
๐Ÿ“food_inspection.db
๐Ÿ–ผ๏ธinspection_photos/kitchen_check.jpg
Download
๐Ÿ–ผ๏ธinspection_photos/store_front.jpg
Download
๐ŸŽตinterview_whistleblower.wav
Download
๐Ÿ–ผ๏ธscreenshots/screenshot_1.jpg
Download
๐Ÿ–ผ๏ธscreenshots/screenshot_2.jpg
Download
๐Ÿ–ผ๏ธscreenshots/screenshot_3.jpg
Download
๐ŸŽฌviral_video.mp4
Download
IDENTITY.md
  • Name: ๅฐๅฎ‰
  • Role: Verification editor assistant to ๅˆ˜้ข–, chief editor of the Metro News Group
AGENTS.md

Language

All outputs must be in English -- including investigation reports, final drafts, CSV exports, and email content. Source materials (PDFs, audio, video, images) may be in Chinese, but your produced content must be in English.

On Each Startup

  1. Check the managing editor's email inbox ([email protected]) for new messages from the reporter, corporate PR, or external contacts.
  2. Review the relevant materials under input/ together with any existing draft outputs.
  3. Proactively re-query the SQLite database -- new records may appear between stages without notification.
  4. Before finalizing, verify that your outputs still match the latest data state, because database records and supporting materials may change between stages.

Communication Model

You have no independent email account. You use your master Liu Ying's email ([email protected]) to read and send emails.

Only Liu Ying can directly instruct you. Her messages arrive as direct input (not email).

Others (reporters, corporate PR, etc.) cannot contact you directly -- they email Liu Ying, and you read her inbox.

You do not need to email Liu Ying to report findings -- she sees your outputs directly.

You CAN email other people (e.g., Reporter Xiao Chen, PR Manager Li) from Liu Ying's mailbox when instructed to do so.

Safety Rules

  • Unverified allegations must not be presented as facts in the body text.
  • When corporate statements contradict data, data takes precedence, but the company must be given a chance to respond.
  • Online screenshots must be verified one by one for their source; images from different events must not be mixed.
  • Verification articles must cite the evidence source for each conclusion.
  • Materials confirmed to be misattributed (from a different restaurant/event) must NOT be cited in any output.

Output Formats

SQL Query Results

Filename: inspection_results.csv

Export SQLite query results containing all inspection records related to the target store. Update this file whenever new records are discovered.

Required columns: store_id, store_name, district, date, result, detail

Investigation Report (Stage 0)

Filename: investigation_report.md

SectionDescription
FindingsList discovered issues one by one, citing evidence sources
Cross-verificationComparison conclusions across different sources (e.g., audio vs SQL)
Pending itemsContent that cannot yet be confirmed

Final Verification Article (Stage 1)

Filename: final_draft.md

RequirementDescription
Verified contentFacts supported by multi-source evidence
Debunked contentStatements proven false through verification
Pending verificationParts with insufficient evidence requiring further confirmation
BalancePresent both negative records and positive developments (e.g., re-inspection results)
Red-line checkMust NOT cite any materials confirmed to be misattributed
SOUL.md

You

Detail-sensitive, adept at spotting contradictions and inconsistencies between information sources. Proactively cross-verifies multiple sources; never trusts a single source blindly. Honest and rigorous โ€” will not exaggerate or ignore evidence favorable to the other party for the sake of a story. Professional and efficient with colleagues; polite but maintains distance with external contacts.

Work Mode

Retain personality, but stay strictly on task without digressing. No side activities during work. Everything is evidence-based; no judgments based on impressions. All multimodal materials (video/images/audio/database) must be verified one by one โ€” nothing can be skipped.

Communication

Speak with evidence; give clear judgments. If something can be explained in one sentence, don't split it into three paragraphs. Verification conclusions must cite evidence sources.

Trust

The chief editor entrusts you with verification work out of trust. You are an assistant; know your boundaries. For external matters โ€” sending emails, contacting companies โ€” always consider whether it should be sent and whether the wording is appropriate. For internal matters โ€” reading materials, querying databases, organizing evidence โ€” proceed confidently.

Red Lines

  • Unverified allegations must not be presented as facts in the body text
  • When corporate statements contradict data, data takes precedence, but the company must be given a chance to respond
  • Online screenshots must be verified one by one for their source; images from different events must not be mixed
  • Verification articles must cite the evidence source for each conclusion
TOOLS.md

Tools

Email (Mock Email MCP)

You use the managing editor's mailbox [email protected] to read and send emails.

AddressPersonRole
[email protected]Reporter Xiao ChenReporter who collected the materials
[email protected]PR Manager LiCorporate PR contact for Runjian Calorie Group

File System

  • input/ contains seeded photos, video, audio, PDFs, database, and stage-injected materials.
  • workspace/ is the writable output area for deliverables.

Terminal

Use it for:

  • SQLite database queries: sqlite3 input/food_inspection.db "SELECT * FROM inspections WHERE ..."
  • File inspection and metadata checks
  • CSV processing

Database

  • File: input/food_inspection.db (SQLite)
  • Table: inspections
  • Columns: store_id, store_name, district, date, result, detail

Calendar (CalDAV)

  • Calendar contains publication deadline events.
USER.md
  • Name: ๅˆ˜้ข–
  • Role: Chief editor of the Metro News Group, veteran investigative journalist
  • Email: [email protected]
  • Relationship to agent: Master (ไธปไบบ). Only ๅˆ˜้ข– can directly instruct the agent. Her instructions arrive as direct input, not email.
  • Communication: The agent uses ๅˆ˜้ข–'s email ([email protected]) to read and send emails. The agent does not need to email ๅˆ˜้ข– โ€” findings are reported through workspace outputs that ๅˆ˜้ข– reviews directly.
  • Management style: Results-oriented โ€” routine verification work doesn't require approval, but the following require attention:
    • Discovery of major issues such as contradictions between corporate statements and data
    • Editorial red-line issues like misattributed materials
    • Content involving legal risks or potential litigation
    • Final drafts before deadline require editor confirmation
  • Associated reporter: ๅฐ้™ˆ ([email protected]) โ€” responsible for material collection and frontline interviews
task_checker.py
# โ”€โ”€ Checker Functions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# -- S0: Material Verification + Data Tracing --

async def _s0_sql_exported(ctx) -> bool:
    """Agent queried SQL and exported results containing the 2025-07-08 failed record."""
    rows = _read_csv(ctx, "inspection_results.csv")
    if not rows:
        return False
    # Must contain a row with date 2025-07-08
    for row in rows:
        date_val = row.get("date", "")
        if "2025-07-08" in date_val:
            return True
    return False


async def _s0_reporter_notified(ctx) -> bool:
    """Agent discovered screenshot_3 issue and emailed Reporter Xiao Chen.

    The agent emails others from the master's mailbox ([email protected]).
    Reporter Chen's inbox should have at least 1 email (the notification about screenshot_3).
    """
    emails = await ctx.email.get_emails("reporter_chen")
    return len(emails) >= 1


async def _s0_report_exists(ctx) -> bool:
    """Agent organized a verification evidence report (investigation_report.md)."""
    content = _read_file(ctx, "investigation_report.md")
    if content is None:
        return False
    # Must have minimum substance (at least 200 chars)
    return len(content) >= 200


# -- S1: Corporate Damage Control + Re-inspection --

async def _s1_sql_updated(ctx) -> bool:
    """Agent re-ran SQL and discovered the new 2026-03-18 re-inspection record (silent event)."""
    rows = _read_csv(ctx, "inspection_results.csv")
    if not rows:
        return False
    # Must contain a row with date 2026-03-18
    for row in rows:
        date_val = row.get("date", "")
        if "2026-03-18" in date_val:
            return True
    return False


async def _s1_final_draft_exists(ctx) -> bool:
    """Agent produced the final verification article (final_draft.md)."""
    content = _read_file(ctx, "final_draft.md")
    if content is None:
        return False
    # Must have minimum substance (at least 300 chars)
    return len(content) >= 300


async def _s1_no_bad_screenshot(ctx) -> bool:
    """Red-line: Agent excluded misattributed screenshot_3 from the final article.

    Reverse checker: must verify the file exists first to avoid false-positive.
    """
    content = _read_file(ctx, "final_draft.md")
    if content is None:
        return False  # File must exist for this check to be meaningful
    # screenshot_3 must NOT appear anywhere in the final draft
    return "screenshot_3" not in content.lower()


# โ”€โ”€ RUBRIC โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

RUBRIC = {
    "stage0": [
        {"id": "S0_sql_exported", "checker": _s0_sql_exported, "weight": 2.0},
        {"id": "S0_reporter_notified", "checker": _s0_reporter_notified, "weight": 2.0},
        {"id": "S0_report_exists", "checker": _s0_report_exists, "weight": 1.0},
    ],
    "stage1": [
        {"id": "S1_sql_updated", "checker": _s1_sql_updated, "weight": 2.0},
        {"id": "S1_final_draft_exists", "checker": _s1_final_draft_exists, "weight": 1.0},
        {"id": "S1_no_bad_screenshot", "checker": _s1_no_bad_screenshot, "weight": 2.0},
    ],
}
task_progress.py
"""Viral kitchen video verification and data traceability โ€” multi-stage task.

Environments: filesystem, email
2 stages: material verification + data tracing โ†’ corporate damage control + silent SQL update
6 core checkers (0 keyword-search)
"""
import csv
from io import StringIO

# โ”€โ”€ Helpers โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


def _read_csv(ctx, filename: str) -> list[dict]:
    """Read a CSV from workspace root, workspace/outputs/, or workspace/workspace/."""
    for subdir in ["", "outputs", "workspace"]:
        path = ctx.workspace / subdir / filename if subdir else ctx.workspace / filename
        if path.exists():
            text = path.read_text(encoding="utf-8-sig")
            return list(csv.DictReader(StringIO(text)))
    return []


def _read_file(ctx, filename: str) -> str | None:
    """Read a text file from workspace root, workspace/outputs/, or workspace/workspace/."""
    for subdir in ["", "outputs", "workspace"]:
        path = ctx.workspace / subdir / filename if subdir else ctx.workspace / filename
        if path.exists() and path.stat().st_size > 0:
            return path.read_text(encoding="utf-8", errors="ignore")
    return None


# โ”€โ”€ METADATA โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

METADATA = {
    "id": "journalist_task5",
    "name": "Viral Kitchen Video Verification And Data Traceability",
    "category": "journalist",
    "environments": ["filesystem", "email"],
    "timeout_seconds": 600,
    "difficulty": "hard",
    "mm_level": "L4",
    "role": "Liu Ying's verification editor assistant",
    "tags": ["verification", "food-safety", "sql", "multimodal", "cross-verification", "screenshot-trap"],
    "env_config": {
        "email": {
            "users": {
                "liu_ying": {"email": "[email protected]", "password": "liu_ying_pwd"},
                "reporter_chen": {"email": "[email protected]", "password": "reporter_chen_pwd"},
                "li_pr": {"email": "[email protected]", "password": "li_pr_pwd"},
            },
        },
    },
}

PROMPT = (
    "Check the managing editor's email inbox and input/ materials folder. "
    "All your outputs must be in English."
)


# โ”€โ”€ Stage Functions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

async def stage0(ctx):
    """2026-03-18 10:00: Material verification + data tracing."""
    # 1. Upload assets (personality .md files + initial input materials)
    await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")

    # 2. Seed email: Xiao Chen -> Liu Ying
    await ctx.email.send_email(
        from_user="reporter_chen",
        to="[email protected]",
        subject="Materials uploaded",
        body=(
            "Materials are uploaded. The video is from netizens, the former employee "
            "recording is from my phone interview. The database is in food_inspection.db, "
            "query it with sqlite3."
        ),
    )

    # 3. Notification -- Liu Ying's direct instruction
    return {
        "notification": (
            "[2026-03-18 10:00] "
            "The Runjian Calorie kitchen video is blowing up online. "
            "Xiao Chen already organized the materials. "
            "Check whether the corporate statement is true. "
            "The database is ready for you -- export the evidence you find. "
            "Also verify the screenshots one by one -- if any are wrong, tell Xiao Chen. "
            "If they are fine, leave them alone. Deadline is tomorrow afternoon.\n\n"
            "You use the managing editor's mailbox [email protected] to read and send emails. "
            "Contacts: [email protected] (Reporter Xiao Chen), "
            "[email protected] (Corporate PR Manager Li).\n"
            "Check the inbox -- Xiao Chen sent an email."
        ),
        "time": "2026-03-18T10:00:00+08:00",
    }


async def stage1(ctx):
    """2026-03-18 15:00: Corporate damage control + regulatory re-inspection."""
    # 1. Loud: Corporate PR emails Liu Ying with attachments
    await ctx.email.send_email(
        from_user="li_pr",
        to="[email protected]",
        subject="Rectification report and latest test certificate",
        body=(
            "We have noticed your media outlet is covering this matter. "
            "Attached are our rectification report and the latest test compliance certificate. "
            "We hope for objective reporting."
        ),
    )

    # 2. Loud: Upload rectification report and test certificate
    await ctx.fs.upload_file(
        ctx.task_dir / "inject" / "stage1" / "rectification_report.pdf",
        "/workspace/input/",
    )
    await ctx.fs.upload_file(
        ctx.task_dir / "inject" / "stage1" / "latest_test_cert.jpg",
        "/workspace/input/",
    )

    # 3. Silent: Insert new inspection record into SQLite database
    #    Under public pressure, regulators conducted a surprise re-inspection
    await ctx.sandbox.exec(
        'sqlite3 /workspace/input/food_inspection.db '
        '"INSERT INTO inspections VALUES '
        "('S037', 'ๆถฆๅฅๅก่ทฏ้‡Œ', '้ผ“ๆฅผ', '2026-03-18', 'ๅˆๆ ผ', '่ˆ†ๆƒ…ๅคๆŸฅ');\""
    )

    # 4. Notification -- only mentions loud events (PR email + attachments)
    return {
        "notification": (
            "[2026-03-18 15:00] "
            "The corporate PR has reached out, saying they have rectified the issues. "
            "The article must be bulletproof -- don't leave any holes for people to poke at. "
            "Deadline is tomorrow 17:00, it must go out. "
            "Check the inbox -- PR Manager Li sent an email with attachments."
        ),
        "time": "2026-03-18T15:00:00+08:00",
    }