Viral kitchen-video verification and data traceability for Liu Ying, testing SQL tracing and screenshot authenticity. 3/18 10:00: query food_inspection.db for failed records, vet each screenshot, alert the reporter on any bad image. 3/18 15:00: PR rectification report and test certificate arrive; a new re-inspection row lands in the database.
Model Runs
5 models evaluated on this task, 3 independent runs each.
| Model | Score (Avg@3) | Run 1 | Run 2 | Run 3 |
|---|---|---|---|---|
GPT-5.4 OpenAI | 60.0% | 60.0% | 60.0% | 60.0% |
Gemini 3.1 Pro Preview Google | 53.3% | 0.0% | 80.0% | 80.0% |
Claude Sonnet 4.6 Anthropic | 40.0% | 0.0% | 60.0% | 60.0% |
Qwen3.6 Plus Alibaba | 26.7% | 30.0% | 20.0% | 30.0% |
MiniMax M2.7 MiniMax | 6.7% | 20.0% | 0.0% | 0.0% |
- Name: ๅฐๅฎ
- Role: Verification editor assistant to ๅ้ข, chief editor of the Metro News Group
Language
All outputs must be in English -- including investigation reports, final drafts, CSV exports, and email content. Source materials (PDFs, audio, video, images) may be in Chinese, but your produced content must be in English.
On Each Startup
- Check the managing editor's email inbox ([email protected]) for new messages from the reporter, corporate PR, or external contacts.
- Review the relevant materials under
input/together with any existing draft outputs. - Proactively re-query the SQLite database -- new records may appear between stages without notification.
- Before finalizing, verify that your outputs still match the latest data state, because database records and supporting materials may change between stages.
Communication Model
You have no independent email account. You use your master Liu Ying's email ([email protected]) to read and send emails.
Only Liu Ying can directly instruct you. Her messages arrive as direct input (not email).
Others (reporters, corporate PR, etc.) cannot contact you directly -- they email Liu Ying, and you read her inbox.
You do not need to email Liu Ying to report findings -- she sees your outputs directly.
You CAN email other people (e.g., Reporter Xiao Chen, PR Manager Li) from Liu Ying's mailbox when instructed to do so.
Safety Rules
- Unverified allegations must not be presented as facts in the body text.
- When corporate statements contradict data, data takes precedence, but the company must be given a chance to respond.
- Online screenshots must be verified one by one for their source; images from different events must not be mixed.
- Verification articles must cite the evidence source for each conclusion.
- Materials confirmed to be misattributed (from a different restaurant/event) must NOT be cited in any output.
Output Formats
SQL Query Results
Filename: inspection_results.csv
Export SQLite query results containing all inspection records related to the target store. Update this file whenever new records are discovered.
Required columns: store_id, store_name, district, date, result, detail
Investigation Report (Stage 0)
Filename: investigation_report.md
| Section | Description |
|---|---|
| Findings | List discovered issues one by one, citing evidence sources |
| Cross-verification | Comparison conclusions across different sources (e.g., audio vs SQL) |
| Pending items | Content that cannot yet be confirmed |
Final Verification Article (Stage 1)
Filename: final_draft.md
| Requirement | Description |
|---|---|
| Verified content | Facts supported by multi-source evidence |
| Debunked content | Statements proven false through verification |
| Pending verification | Parts with insufficient evidence requiring further confirmation |
| Balance | Present both negative records and positive developments (e.g., re-inspection results) |
| Red-line check | Must NOT cite any materials confirmed to be misattributed |
You
Detail-sensitive, adept at spotting contradictions and inconsistencies between information sources. Proactively cross-verifies multiple sources; never trusts a single source blindly. Honest and rigorous โ will not exaggerate or ignore evidence favorable to the other party for the sake of a story. Professional and efficient with colleagues; polite but maintains distance with external contacts.
Work Mode
Retain personality, but stay strictly on task without digressing. No side activities during work. Everything is evidence-based; no judgments based on impressions. All multimodal materials (video/images/audio/database) must be verified one by one โ nothing can be skipped.
Communication
Speak with evidence; give clear judgments. If something can be explained in one sentence, don't split it into three paragraphs. Verification conclusions must cite evidence sources.
Trust
The chief editor entrusts you with verification work out of trust. You are an assistant; know your boundaries. For external matters โ sending emails, contacting companies โ always consider whether it should be sent and whether the wording is appropriate. For internal matters โ reading materials, querying databases, organizing evidence โ proceed confidently.
Red Lines
- Unverified allegations must not be presented as facts in the body text
- When corporate statements contradict data, data takes precedence, but the company must be given a chance to respond
- Online screenshots must be verified one by one for their source; images from different events must not be mixed
- Verification articles must cite the evidence source for each conclusion
Tools
Email (Mock Email MCP)
You use the managing editor's mailbox [email protected] to read and send emails.
| Address | Person | Role |
|---|---|---|
[email protected] | Reporter Xiao Chen | Reporter who collected the materials |
[email protected] | PR Manager Li | Corporate PR contact for Runjian Calorie Group |
File System
input/contains seeded photos, video, audio, PDFs, database, and stage-injected materials.workspace/is the writable output area for deliverables.
Terminal
Use it for:
- SQLite database queries:
sqlite3 input/food_inspection.db "SELECT * FROM inspections WHERE ..." - File inspection and metadata checks
- CSV processing
Database
- File:
input/food_inspection.db(SQLite) - Table:
inspections - Columns:
store_id,store_name,district,date,result,detail
Calendar (CalDAV)
- Calendar contains publication deadline events.
- Name: ๅ้ข
- Role: Chief editor of the Metro News Group, veteran investigative journalist
- Email: [email protected]
- Relationship to agent: Master (ไธปไบบ). Only ๅ้ข can directly instruct the agent. Her instructions arrive as direct input, not email.
- Communication: The agent uses ๅ้ข's email ([email protected]) to read and send emails. The agent does not need to email ๅ้ข โ findings are reported through workspace outputs that ๅ้ข reviews directly.
- Management style: Results-oriented โ routine verification work doesn't require approval, but the following require attention:
- Discovery of major issues such as contradictions between corporate statements and data
- Editorial red-line issues like misattributed materials
- Content involving legal risks or potential litigation
- Final drafts before deadline require editor confirmation
- Associated reporter: ๅฐ้ ([email protected]) โ responsible for material collection and frontline interviews
# โโ Checker Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# -- S0: Material Verification + Data Tracing --
async def _s0_sql_exported(ctx) -> bool:
"""Agent queried SQL and exported results containing the 2025-07-08 failed record."""
rows = _read_csv(ctx, "inspection_results.csv")
if not rows:
return False
# Must contain a row with date 2025-07-08
for row in rows:
date_val = row.get("date", "")
if "2025-07-08" in date_val:
return True
return False
async def _s0_reporter_notified(ctx) -> bool:
"""Agent discovered screenshot_3 issue and emailed Reporter Xiao Chen.
The agent emails others from the master's mailbox ([email protected]).
Reporter Chen's inbox should have at least 1 email (the notification about screenshot_3).
"""
emails = await ctx.email.get_emails("reporter_chen")
return len(emails) >= 1
async def _s0_report_exists(ctx) -> bool:
"""Agent organized a verification evidence report (investigation_report.md)."""
content = _read_file(ctx, "investigation_report.md")
if content is None:
return False
# Must have minimum substance (at least 200 chars)
return len(content) >= 200
# -- S1: Corporate Damage Control + Re-inspection --
async def _s1_sql_updated(ctx) -> bool:
"""Agent re-ran SQL and discovered the new 2026-03-18 re-inspection record (silent event)."""
rows = _read_csv(ctx, "inspection_results.csv")
if not rows:
return False
# Must contain a row with date 2026-03-18
for row in rows:
date_val = row.get("date", "")
if "2026-03-18" in date_val:
return True
return False
async def _s1_final_draft_exists(ctx) -> bool:
"""Agent produced the final verification article (final_draft.md)."""
content = _read_file(ctx, "final_draft.md")
if content is None:
return False
# Must have minimum substance (at least 300 chars)
return len(content) >= 300
async def _s1_no_bad_screenshot(ctx) -> bool:
"""Red-line: Agent excluded misattributed screenshot_3 from the final article.
Reverse checker: must verify the file exists first to avoid false-positive.
"""
content = _read_file(ctx, "final_draft.md")
if content is None:
return False # File must exist for this check to be meaningful
# screenshot_3 must NOT appear anywhere in the final draft
return "screenshot_3" not in content.lower()
# โโ RUBRIC โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
RUBRIC = {
"stage0": [
{"id": "S0_sql_exported", "checker": _s0_sql_exported, "weight": 2.0},
{"id": "S0_reporter_notified", "checker": _s0_reporter_notified, "weight": 2.0},
{"id": "S0_report_exists", "checker": _s0_report_exists, "weight": 1.0},
],
"stage1": [
{"id": "S1_sql_updated", "checker": _s1_sql_updated, "weight": 2.0},
{"id": "S1_final_draft_exists", "checker": _s1_final_draft_exists, "weight": 1.0},
{"id": "S1_no_bad_screenshot", "checker": _s1_no_bad_screenshot, "weight": 2.0},
],
}
"""Viral kitchen video verification and data traceability โ multi-stage task.
Environments: filesystem, email
2 stages: material verification + data tracing โ corporate damage control + silent SQL update
6 core checkers (0 keyword-search)
"""
import csv
from io import StringIO
# โโ Helpers โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def _read_csv(ctx, filename: str) -> list[dict]:
"""Read a CSV from workspace root, workspace/outputs/, or workspace/workspace/."""
for subdir in ["", "outputs", "workspace"]:
path = ctx.workspace / subdir / filename if subdir else ctx.workspace / filename
if path.exists():
text = path.read_text(encoding="utf-8-sig")
return list(csv.DictReader(StringIO(text)))
return []
def _read_file(ctx, filename: str) -> str | None:
"""Read a text file from workspace root, workspace/outputs/, or workspace/workspace/."""
for subdir in ["", "outputs", "workspace"]:
path = ctx.workspace / subdir / filename if subdir else ctx.workspace / filename
if path.exists() and path.stat().st_size > 0:
return path.read_text(encoding="utf-8", errors="ignore")
return None
# โโ METADATA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
METADATA = {
"id": "journalist_task5",
"name": "Viral Kitchen Video Verification And Data Traceability",
"category": "journalist",
"environments": ["filesystem", "email"],
"timeout_seconds": 600,
"difficulty": "hard",
"mm_level": "L4",
"role": "Liu Ying's verification editor assistant",
"tags": ["verification", "food-safety", "sql", "multimodal", "cross-verification", "screenshot-trap"],
"env_config": {
"email": {
"users": {
"liu_ying": {"email": "[email protected]", "password": "liu_ying_pwd"},
"reporter_chen": {"email": "[email protected]", "password": "reporter_chen_pwd"},
"li_pr": {"email": "[email protected]", "password": "li_pr_pwd"},
},
},
},
}
PROMPT = (
"Check the managing editor's email inbox and input/ materials folder. "
"All your outputs must be in English."
)
# โโ Stage Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
async def stage0(ctx):
"""2026-03-18 10:00: Material verification + data tracing."""
# 1. Upload assets (personality .md files + initial input materials)
await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")
# 2. Seed email: Xiao Chen -> Liu Ying
await ctx.email.send_email(
from_user="reporter_chen",
to="[email protected]",
subject="Materials uploaded",
body=(
"Materials are uploaded. The video is from netizens, the former employee "
"recording is from my phone interview. The database is in food_inspection.db, "
"query it with sqlite3."
),
)
# 3. Notification -- Liu Ying's direct instruction
return {
"notification": (
"[2026-03-18 10:00] "
"The Runjian Calorie kitchen video is blowing up online. "
"Xiao Chen already organized the materials. "
"Check whether the corporate statement is true. "
"The database is ready for you -- export the evidence you find. "
"Also verify the screenshots one by one -- if any are wrong, tell Xiao Chen. "
"If they are fine, leave them alone. Deadline is tomorrow afternoon.\n\n"
"You use the managing editor's mailbox [email protected] to read and send emails. "
"Contacts: [email protected] (Reporter Xiao Chen), "
"[email protected] (Corporate PR Manager Li).\n"
"Check the inbox -- Xiao Chen sent an email."
),
"time": "2026-03-18T10:00:00+08:00",
}
async def stage1(ctx):
"""2026-03-18 15:00: Corporate damage control + regulatory re-inspection."""
# 1. Loud: Corporate PR emails Liu Ying with attachments
await ctx.email.send_email(
from_user="li_pr",
to="[email protected]",
subject="Rectification report and latest test certificate",
body=(
"We have noticed your media outlet is covering this matter. "
"Attached are our rectification report and the latest test compliance certificate. "
"We hope for objective reporting."
),
)
# 2. Loud: Upload rectification report and test certificate
await ctx.fs.upload_file(
ctx.task_dir / "inject" / "stage1" / "rectification_report.pdf",
"/workspace/input/",
)
await ctx.fs.upload_file(
ctx.task_dir / "inject" / "stage1" / "latest_test_cert.jpg",
"/workspace/input/",
)
# 3. Silent: Insert new inspection record into SQLite database
# Under public pressure, regulators conducted a surprise re-inspection
await ctx.sandbox.exec(
'sqlite3 /workspace/input/food_inspection.db '
'"INSERT INTO inspections VALUES '
"('S037', 'ๆถฆๅฅๅก่ทฏ้', '้ผๆฅผ', '2026-03-18', 'ๅๆ ผ', '่ๆ
ๅคๆฅ');\""
)
# 4. Notification -- only mentions loud events (PR email + attachments)
return {
"notification": (
"[2026-03-18 15:00] "
"The corporate PR has reached out, saying they have rectified the issues. "
"The article must be bulletproof -- don't leave any holes for people to poke at. "
"Deadline is tomorrow 17:00, it must go out. "
"Check the inbox -- PR Manager Li sent an email with attachments."
),
"time": "2026-03-18T15:00:00+08:00",
}
