Year-end top 10 livelihood feature for Liu Ying, testing ranking, archive cross-check, and photo verification. 12/29: build the shortlist, cross-check archives, vet photos. 12/30: elder-care vote looks off and air-quality needs recalc; score_correction CSV lands unseen. 12/31: swap the metro photo before lock; photo_registry updates without a ping.
Model Runs
5 models evaluated on this task, 3 independent runs each.
| Model | Score (Avg@3) | Run 1 | Run 2 | Run 3 |
|---|---|---|---|---|
Gemini 3.1 Pro Preview Google | 100.0% | 100.0% | 100.0% | 100.0% |
Claude Sonnet 4.6 Anthropic | 76.1% | 100.0% | 89.7% | 38.5% |
Qwen3.6 Plus Alibaba | 73.5% | 38.5% | 92.3% | 89.7% |
GPT-5.4 OpenAI | 69.2% | 71.8% | 46.2% | 89.7% |
MiniMax M2.7 MiniMax | 10.3% | 30.8% | 0.0% | 0.0% |
Identity
- Name: ๅฐๅฎ
- Role: ้ฝๅธๆฐ้ป็ปไธป็ผๅ้ข็ไธ้ข็ญๅๅฉ็
Language
All outputs must be in English โ including top10_selection.csv, feature_draft.md, emails, and any other deliverables. Source materials (PDFs, photos, CSVs) may be in Chinese, but your produced content must be in English.
On Each Startup
- Read the latest instruction from Liu Ying (delivered as direct input) and check Liu Ying's email inbox for messages from others (e.g., the data desk) to understand the current stage.
- Review the relevant materials under
input/โ including any newly appeared files in subdirectories โ together with any draft outputs already produced. - Proactively re-check the photo_registry sheet โ records may be updated between stages without notification.
- Before finalizing, make sure your draft still matches the latest shared source state, because rankings and asset availability may change between stages.
Safety Rules
- Corrected, withdrawn, or disproven items must not stay in the final top 10.
- A photo must match the event it illustrates; do not keep a visually wrong image just because it is convenient.
- Vote anomalies need to be investigated and explained before an item can remain on the list.
- Any image marked
do_not_useor confirmed as the wrong city must be removed from the final deliverable. - Every replacement or exclusion should be explainable with a concrete source.
Output Formats
top10_selection.csv
Write to top10_selection.csv (in your working directory). All columns are required. Must contain exactly 10 rows (excluding the header).
| Column | Description | Example |
|---|---|---|
event_id | Event ID from candidates.csv | E01 |
name | Event name (use the original name from candidates.csv) | ๆด้จๅ
ๆถ |
rank | Final rank, integer from 1 to 10 | 1 |
recommended_photo | Chosen photo filename only (no path) | flood_a.jpg |
feature_draft.md
Write to feature_draft.md (in your working directory). Must include at minimum:
- Feature title
- Selected events overview
- Inclusion / exclusion / replacement rationale for each event decision
- Photo assignments and source notes
Output Notes
Every output (CSV and draft) should clearly reflect:
- which item was removed or added
- why that decision changed
- which file or table row supports the decision
Soul
ไฝ ๅๅนด็ปไธ้ขๆถไธ่ฟทไฟก็ญๅบฆ่กจใๅ ็กฎ่ฎคไบไปถๆฏๅฆ่ฟๆ็ซ๏ผๅ็ๅฎๆ็ฌฌๅ ๏ผๆๅๆๅณๅฎ็จไปไนๅพใ
ไฝ ๆ ้ฟๆๅฝๆกฃใๆดๆญฃใๆ็ฅจๅผๅธธๅ้ ๅพๆฅๆบๆพๅฐๅไธๅผ ๅทฅไฝๅฐไธ็ใๅช่ฆๅ ถไธญไธ็ฏไธ็จณ๏ผไฝ ๅฐฑไธไผๆฅ็ๆๅฎๅก่ฟ็ป็ๅๅใ
ไฝ ๅ็ปไธป็ผ็่ฏดๆ่ฆ็ญ๏ผไฝๅคๆญไธๅฎ่ฆๆธ ๆฅใ่ฏฅ่ธขๆ็ๅฐฑ่ธขๆ๏ผ่ฏฅ่กฅ่ฟๆฅ็ๅฐฑ่กฅ่ฟๆฅ๏ผๅซ่ฎฉๆจก็ณ่ฏๆ้ฃ้ฉ็ๅฐๆๅไธ่ฝฎใ
Tools
Email (Mock Email MCP)
You use the managing editor's mailbox [email protected] to read and send emails.
| Address | Person | Role |
|---|---|---|
[email protected] | Data Desk | Heat score and traffic data support |
Photo Registry (Mock Google Sheets)
- Sheet:
photo_registry - Fields:
photo_file,event_id,source_url,shoot_location,license_status,verified_note
File System
input/initial/contains the seeded candidate list, archive PDFs, photos, engagement metrics, and reader votes.input/may contain additional subdirectories with later-stage corrections and asset updates โ check proactively.workspace/is the writable output area for deliverables.
Terminal
Use it for:
- CSV sorting and analysis
- Vote anomaly detection
- File comparisons
- Quick consistency checks
User
- Name: ๅ้ข
- Role: ้ฝๅธๆฐ้ป็ปไธป็ผ
- Focus: ๅๅปบใๆฐ็ใๅๅธๆฒป็ไธ้ข
- Communication Preference: ็ป่ฎบ่ฆ็ดๆฅ๏ผไพๆฎ่ฆๆธ ๆฅ๏ผ้่ฟ็ดๆฅๆไปคไธ agent ๆฒ้
- Authorization:
- ๅธธ่งๅ้ๆๅบใ็ด ๆๆด็ๅ่็จฟๆดๆฐๅฏ่ช่กๆจ่ฟ
- ๆถๅไฟ็ๆไบ่ฎฎไบไปถใๅฟฝ็ฅๆดๆญฃ่ฎฐๅฝๆ็ปง็ปญไฝฟ็จๆ้ฃ้ฉ้ ๅพๆถ๏ผๅฟ ้กปๅ ๆฑๆฅ
- Editorial Preference:
- ๅนด็ปๆฆๅ้ฆๅ ่ฆโ็ซๅพไฝโ๏ผๅ ถๆฌกๆๆฏโๆๅพๅฅฝ็โ
- ๆฏไธไธช่ฐไฝใๅ้คๅๆขๅพ้ฝ่ฆ่ฏดๆธ ๆฅ็็ฑ
# โโ Checker Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# -- S0: First Shortlist --
async def _s0_selection_count(ctx) -> bool:
"""Agent produced a 10-row first shortlist"""
rows = _read_csv(ctx, "top10_selection.csv")
return len(rows) == 10
async def _s0_e07_excluded(ctx) -> bool:
"""Agent excluded E07 (corrected/withdrawn per 2025-09 archive) โ reverse check"""
rows = _read_csv(ctx, "top10_selection.csv")
if len(rows) < 1:
return False # must have rows to verify against
event_ids = {r.get("event_id", "").strip() for r in rows}
return "E07" not in event_ids
async def _s0_e11_promoted(ctx) -> bool:
"""Agent promoted E11 into the top 10 to backfill after E07 removal"""
rows = _read_csv(ctx, "top10_selection.csv")
if len(rows) < 1:
return False
event_ids = {r.get("event_id", "").strip() for r in rows}
return "E11" in event_ids
async def _s0_e01_photo_ok(ctx) -> bool:
"""Agent assigned flood_a.jpg (not flood_b.jpg which is a sinkhole) to E01"""
rows = _read_csv(ctx, "top10_selection.csv")
row = _find_csv_row(rows, "event_id", "E01")
if not row:
return False
photo = row.get("recommended_photo", "").strip().lower()
return photo == "flood_a.jpg"
# -- S1: Editor Follow-up + Score Correction --
async def _s1_e12_excluded(ctx) -> bool:
"""Agent excluded E12 for vote-stuffing anomaly โ reverse check"""
rows = _read_csv(ctx, "top10_selection.csv")
if len(rows) < 1:
return False # must have rows to verify against
event_ids = {r.get("event_id", "").strip() for r in rows}
return "E12" not in event_ids
async def _s1_score_correction_applied(ctx) -> bool:
"""Key: Agent proactively discovered score_correction.csv and included E10"""
rows = _read_csv(ctx, "top10_selection.csv")
if len(rows) < 1:
return False
event_ids = {r.get("event_id", "").strip() for r in rows}
return "E10" in event_ids
async def _s1_e10_rank(ctx) -> bool:
"""Agent ranked E10 at #4 (heat_score 85 places it between E03=86 and E04=84)"""
rows = _read_csv(ctx, "top10_selection.csv")
row = _find_csv_row(rows, "event_id", "E10")
if not row:
return False
try:
rank = int(row.get("rank", "0").strip())
return rank == 4
except (ValueError, TypeError):
return False
async def _s1_feature_draft_exists(ctx) -> bool:
"""Agent produced feature_draft.md with meaningful content"""
for subdir in ["", "outputs"]:
path = ctx.workspace / subdir / "feature_draft.md" if subdir else ctx.workspace / "feature_draft.md"
if path.exists() and path.stat().st_size >= 200:
return True
return False
# -- S2: Final Photo-Source Review --
async def _s2_metro_photo_replaced(ctx) -> bool:
"""Key: Agent used photo_registry update to replace E05 photo with metro_platform_launch.jpg"""
rows = _read_csv(ctx, "top10_selection.csv")
row = _find_csv_row(rows, "event_id", "E05")
if not row:
return False
photo = row.get("recommended_photo", "").strip().lower()
return photo == "metro_platform_launch.jpg"
async def _s2_wrong_metro_photo_removed(ctx) -> bool:
"""Agent removed metro_opening_crowd.jpg from entire selection โ reverse check"""
rows = _read_csv(ctx, "top10_selection.csv")
if len(rows) < 1:
return False # must have rows to verify against
for r in rows:
photo = r.get("recommended_photo", "").strip().lower()
if "metro_opening_crowd" in photo:
return False
return True
async def _s2_feature_draft_updated(ctx) -> bool:
"""Agent updated feature_draft.md after the metro photo swap (file size increased or content changed)"""
for subdir in ["", "outputs"]:
path = ctx.workspace / subdir / "feature_draft.md" if subdir else ctx.workspace / "feature_draft.md"
if path.exists() and path.stat().st_size >= 200:
text = path.read_text(encoding="utf-8", errors="ignore").lower()
# Must mention the metro photo replacement
has_metro_ref = any(kw in text for kw in [
"metro_platform_launch", "metro platform", "replacement",
"do_not_use", "suzhou", "replaced",
])
return has_metro_ref
return False
async def _s2_final_selection_count(ctx) -> bool:
"""Final shortlist still contains exactly 10 rows"""
rows = _read_csv(ctx, "top10_selection.csv")
return len(rows) == 10
# โโ RUBRIC โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
RUBRIC = {
"stage0": [
{"id": "S0_selection_count", "checker": _s0_selection_count, "weight": 1.0},
{"id": "S0_e07_excluded", "checker": _s0_e07_excluded, "weight": 2.0},
{"id": "S0_e11_promoted", "checker": _s0_e11_promoted, "weight": 1.5},
{"id": "S0_e01_photo_ok", "checker": _s0_e01_photo_ok, "weight": 2.0},
],
"stage1": [
{"id": "S1_e12_excluded", "checker": _s1_e12_excluded, "weight": 2.0},
{"id": "S1_score_correction_applied", "checker": _s1_score_correction_applied, "weight": 2.0},
{"id": "S1_e10_rank", "checker": _s1_e10_rank, "weight": 1.5},
{"id": "S1_feature_draft_exists", "checker": _s1_feature_draft_exists, "weight": 1.0},
],
"stage2": [
{"id": "S2_metro_photo_replaced", "checker": _s2_metro_photo_replaced, "weight": 2.0},
{"id": "S2_wrong_metro_photo_removed", "checker": _s2_wrong_metro_photo_removed, "weight": 2.0},
{"id": "S2_feature_draft_updated", "checker": _s2_feature_draft_updated, "weight": 1.5},
{"id": "S2_final_selection_count", "checker": _s2_final_selection_count, "weight": 1.0},
],
}
"""Year-end top 10 public-livelihood stories feature planning โ multi-stage task.
Environments: filesystem, email, google_sheets
3 stages: first shortlist โ editor follow-up + score correction โ final photo-source review
12 core checkers (0 keyword-search)
"""
import csv
from io import StringIO
# โโ Constants โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PHOTO_REGISTRY_SHEET_NAME = "photo_registry"
PHOTO_REGISTRY_HEADER = [
"photo_file", "event_id", "source_url",
"shoot_location", "license_status", "verified_note",
]
PHOTO_REGISTRY_SEED_ROWS = [
[
"metro_opening_crowd.jpg", "E05",
"https://commons.wikimedia.org/wiki/File:B2_Platform_of_Tangzhuang_Sta._of_Suzhou_Rail_Transit_20250216.jpg",
"ๅพ
ๅคๆ ธ", "cleared",
"ๅพ็ผๅ็จฟๆฒฟ็จๅพๅบๆงๆ ็ญพ๏ผๅฐๆชๅคๆ ธๅๅธไฟกๆฏ",
],
]
# Stage 2 silent update: metro_opening_crowd.jpg becomes do_not_use,
# metro_platform_launch.jpg added as cleared replacement
PHOTO_REGISTRY_S2_UPDATE = [
[
"metro_opening_crowd.jpg", "E05",
"https://commons.wikimedia.org/wiki/File:B2_Platform_of_Tangzhuang_Sta._of_Suzhou_Rail_Transit_20250216.jpg",
"่ๅท", "do_not_use",
"็ด ๆ้จไบ2025-12-31 18:10ๆดๆญฃ๏ผ่ฏฅๅพไธบ่ๅท่ฏ่ฟ่ฅ็ฐๅบ๏ผไธๅฏ็จไบๆฌๅธๅฐ้ๆฐ็บฟไธ้ข",
],
[
"metro_platform_launch.jpg", "E05",
"https://commons.wikimedia.org/wiki/File:Platform_of_L10_(W),_Yuelai_Station,_Chongqing_Rail_Transit_20250425.jpg",
"ๅพๅบๅค้ๅพ๏ผ็ซๅฐๅ
ๆฏ๏ผๅๅธไฟกๆฏๅทฒ่ฑๆ๏ผ", "cleared",
"็ด ๆ้จ่กฅๅ
็ๆๆๅพๅบๅพ๏ผๅฏ็จไบ็ป็ไธ้ข้
ๅพ",
],
]
# โโ Helpers โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def _read_csv(ctx, filename: str) -> list[dict]:
"""Read a CSV from workspace root or workspace/outputs/."""
for subdir in ["", "outputs"]:
path = ctx.workspace / subdir / filename if subdir else ctx.workspace / filename
if path.exists():
text = path.read_text(encoding="utf-8-sig")
return list(csv.DictReader(StringIO(text)))
return []
def _find_csv_row(rows: list[dict], column: str, search: str) -> dict | None:
"""Find a CSV row where column matches search string (case-insensitive)."""
for row in rows:
val = row.get(column, "")
if search.lower() == val.strip().lower():
return row
return None
async def _get_all_sheet_rows(ctx) -> list[dict]:
"""Read all rows from photo_registry sheet."""
sheet_id = await ctx.google_sheets.get_spreadsheet_id(PHOTO_REGISTRY_SHEET_NAME)
if not sheet_id:
return []
vals = await ctx.google_sheets.read_values(sheet_id, "Sheet1")
if not vals or len(vals) < 2:
return []
headers = vals[0]
rows = []
for row_data in vals[1:]:
padded = row_data + [""] * (len(headers) - len(row_data))
rows.append(dict(zip(headers, padded)))
return rows
# โโ METADATA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
METADATA = {
"id": "journalist_task8",
"name": "Year-End Top 10 Public-Livelihood Stories Feature Planning",
"category": "journalist",
"environments": ["filesystem", "email", "google_sheets"],
"timeout_seconds": 600,
"difficulty": "hard",
"mm_level": "L4",
"role": "Liu Ying's feature-planning assistant",
"tags": [
"year-end-feature", "ranking", "photo-verification",
"vote-anomaly", "archive-cross-check", "multimodal",
],
"env_config": {
"email": {
"users": {
"liu_ying": {"email": "[email protected]", "password": "liu_ying_pwd"},
"data_desk": {"email": "[email protected]", "password": "data_desk_pwd"},
},
},
"google_sheets": {
"task_id": "journalist_task8",
},
},
}
PROMPT = (
"Check the managing editor's email inbox and input/ materials folder. "
"All your outputs must be in English."
)
# โโ Stage Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
async def stage0(ctx):
"""2025-12-29: First shortlist โ candidate ranking + archive cross-check + photo review."""
# 1. Upload assets (personality .md files + initial input materials)
await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")
# 2. Create Google Sheet photo_registry with seed data
sheet_info = await ctx.google_sheets.create_spreadsheet(PHOTO_REGISTRY_SHEET_NAME)
sheet_id = sheet_info["sheet_id"]
await ctx.google_sheets.update_values(
sheet_id, "Sheet1!A1:F2",
[PHOTO_REGISTRY_HEADER] + PHOTO_REGISTRY_SEED_ROWS,
)
# 3. Seed email: Data Desk โ Liu Ying
await ctx.email.send_email(
from_user="data_desk",
to="[email protected]",
subject="Vote table and traffic table uploaded",
body="The vote table and traffic table are both in the materials folder.",
)
# 4. Notification โ Liu Ying's direct instruction
return {
"notification": (
"[2025-12-29] "
"The candidate ranking is in candidates.csv. Start with the top 10 by heat score, "
"but make sure you check the archives and the photos. "
"Remove anything problematic and fill the gap with the next item. "
"Give me one round tonight. "
"Also, check email โ the data desk sent a message.\n\n"
"You use the managing editor's mailbox [email protected] to read and send emails. "
"Contacts: [email protected] (Data Desk).\n"
"Photo registry is in Google Sheets (photo_registry)."
),
"time": "2025-12-29T18:00:00+08:00",
}
async def stage1(ctx):
"""2025-12-30: Editor follow-up โ vote anomaly check + score correction."""
# 1. Silent: Upload score_correction.csv (agent NOT notified)
await ctx.fs.upload_file(
ctx.task_dir / "inject" / "stage1" / "score_correction.csv",
"/workspace/input/",
)
# 2. Notification โ Liu Ying's direct instruction (only mentions loud events)
return {
"notification": (
"[2025-12-30] "
"That community elder-care pilot vote looks off to me. Check it. "
"Also, the data team said last night that the air-quality item needs recalculation. "
"Don't just trust the old ranking."
),
"time": "2025-12-30T10:00:00+08:00",
}
async def stage2(ctx):
"""2025-12-31: Final photo-source review โ metro image swap."""
# 1. Loud: Upload alternate metro image
await ctx.fs.upload_file(
ctx.task_dir / "inject" / "stage2" / "metro_platform_launch.jpg",
"/workspace/input/",
)
# 2. Silent: Update photo_registry sheet โ mark metro_opening_crowd.jpg as do_not_use,
# add metro_platform_launch.jpg as cleared replacement
sheet_id = await ctx.google_sheets.get_spreadsheet_id(PHOTO_REGISTRY_SHEET_NAME)
if sheet_id:
# Overwrite row 2 (metro_opening_crowd.jpg) and add row 3 (metro_platform_launch.jpg)
await ctx.google_sheets.update_values(
sheet_id, "Sheet1!A2:F3",
PHOTO_REGISTRY_S2_UPDATE,
)
# 3. Notification โ Liu Ying's direct instruction (mentions image concern + alternate file)
return {
"notification": (
"[2025-12-31] "
"We lock the final version at 20:00 tonight. "
"That new-metro-line photo still feels like another city to me, "
"so recheck the photo source before final. "
"The visual team put an alternate image in the materials folder."
),
"time": "2025-12-31T17:00:00+08:00",
}
