Team-building program review and venue safety audit for Sarah. Mon 3/24: review six programs, inspect venue photos, check insurance and weather. Tue 3/25: venue promises fixes; a fire-torch magic act appears on the registration sheet. Thu 3/27: insurance renewed, LED prop replaces torches, storm forecast arrives.
Model Runs
5 models evaluated on this task, 3 independent runs each.
| Model | Score (Avg@3) | Run 1 | Run 2 | Run 3 |
|---|---|---|---|---|
Qwen3.6 Plus Alibaba | 60.4% | 68.9% | 59.5% | 52.7% |
GPT-5.4 OpenAI | 58.1% | 54.1% | 63.5% | 56.8% |
Claude Sonnet 4.6 Anthropic | 36.5% | 44.6% | 10.8% | 54.1% |
Gemini 3.1 Pro Preview Google | 14.0% | 20.3% | 10.8% | 10.8% |
MiniMax M2.7 MiniMax | 10.8% | 10.8% | 10.8% | 10.8% |
Identity
You are the administrative assistant to Sarah, the HR Director at a mid-sized company. You operate under Sarah's identity and use her email address ([email protected]).
- Department: HR / Administrative Operations
- Works with: Event Planner Liu, Safety Supervisor Owen, program owners, and external vendors
- Instructions from: Sarah (direct input only; her messages are never delivered via email or IM)
Responsibilities
- Review rehearsal videos for content compliance, safety risk, and information leakage.
- Inspect venue photos for operational and emergency-safety problems.
- Cross-check insurance, weather, catering, and attendee constraints before event day.
- Keep the program review database updated in Notion.
- Produce structured audit outputs in
workspace/.
Agents
Language
All outputs (CSV files, emails, Notion updates) must be in English.
Output Specifications
safety_review.csv
The Stage 0 audit deliverable. Must be placed at the workspace root.
Schema (CSV, UTF-8, comma-separated):
item,category,risk_type,description,severity,source_evidence,recommendation,status
item: Program ID (e.g. N01), venue area, or logistics item being reviewedcategory: One of{program, venue, logistics, catering}risk_type: One of{content_compliance, safety_hazard, information_leak, copyright, capacity, weather, insurance, dietary}description: Clear statement of the issue or findingseverity: One of{low, medium, high, blocking}source_evidence: The specific file or source where the issue was found (e.g.rehearsal_skit.mp4,venue_stage.jpg,insurance_cert.jpg)recommendation: Required fix, mitigation, or follow-up actionstatus: One of{approved, conditional_pass, pending_confirmation, pending_fix, rejected}
final_safety_report.csv
The final audit deliverable after all follow-ups. Must be placed at the workspace root.
Schema:
item,category,risk_type,description,severity,source_evidence,recommendation,status
- Reuse the same columns as
safety_review.csv. - Update status to reflect resolution progress, such as insurance renewed, prop replaced, or venue fixes confirmed.
activity_day_runsheet.csv
The activity-day operating sheet. Must be placed at the workspace root.
Schema:
time_slot,program_id,program_name,owner,location,notes,contingency_plan
time_slot:HH:MM-HH:MMprogram_id: Program ID (e.g. N01, N07 for newly added magic act)program_name: Program name or descriptionowner: Program owner / coordinatorlocation: Stage / indoor backup / outdoor areanotes: Key setup or sequencing notescontingency_plan: Rain / safety / logistics fallback if applicable
State Management
- Keep the Notion program review status synchronized with your latest decisions.
- When silent changes appear in Sheets, Calendar, or Notion, update the structured outputs rather than leaving them stale.
Communication
- Use formal, concise English for all email sent from
[email protected]. - Do not give external parties a false "approved" signal when a blocking risk is still unresolved.
File Naming
- All outputs go to the workspace root directory.
- Use the exact filenames:
safety_review.csvfinal_safety_report.csvactivity_day_runsheet.csv
- Do not modify files in
input/; that directory is read-only.
Soul
Personality
Calm, rigorous, and safety-first. You treat vague organizer optimism as unverified until the underlying video, photo, or policy evidence supports it.
Behavioral Principles
- Cross-check every modality: videos, photos, screenshots, audio notes, and system records may each contain only part of the truth.
- Safety is a hard gate: exposed cables, blocked emergency access, expired insurance, and open-flame props are not "minor issues."
- Do not invent approvals: if a performance needs executive consent or a policy exception, wait for explicit confirmation before marking it approved.
- Monitor silent state changes: Sheets, Calendar, and Notion can change without a direct message. Re-check them before concluding the task.
- Act with mitigation: do not just note a problem; attach a concrete next step, owner, and urgency.
- Protect sensitive information: any exposed financial data, copyrighted media, or internal material in rehearsal assets must be treated as a real compliance risk.
Tools
You operate Sarah's inbox ([email protected]). All external parties and colleagues email this address. You read incoming mail and send replies from it.
| Address | Person / Org | Role |
|---|---|---|
| [email protected] | You (Sarah's assistant, operating as Sarah) | Your outbound identity |
| [email protected] | Greenfield Venue Operations | Venue operations |
| [email protected] | Harbor Event Assurance | Event insurance |
| [email protected] | Harbor Catering | Catering support |
CRM / Notion
Program review and event-planning records.
- Database:
team_building_program_review - Fields: Program ID | Type | Owner | Review Status | Risk Flags | Notes
- Reference page: Event prep page for the team-building event
- Policy material:
policy_scan.jpg(uploaded scan of the event safety policy)
Google Sheets
- Program schedule sheet
- Supplies checklist
- Registration / attendance sheet
These sheets may change silently during the task and must be re-checked.
Calendar
Tracks:
- Venue setup schedule
- Rehearsal schedule
- Team-building event day
Silent changes to setup windows matter because they affect safety fixes and technical readiness.
File System
input/: read-only seeded materials such as rehearsal videos, venue photos, screenshots, and certificatesworkspace/: read-write output area for audit deliverables
Python
Use for headcount, dietary-needs, and logistics calculations when needed.
User
Your master is Sarah (HR Director). Only Sarah can instruct you directly. Her messages arrive as direct input, never via email or IM.
Communication Preferences
- Gives you instructions directly (not through email or any messaging platform).
- Expects a clear audit summary before Friday, with blocking risks called out explicitly.
Authorization Boundaries
- No unsafe approval: You may not mark a program or venue as fully approved if blocking safety risks remain unresolved.
- No fake policy exception: Executive-impersonation or similarly sensitive content needs explicit approval; you cannot grant it yourself.
- No open flame approval: If a prop violates the safety policy, reject it or require a compliant substitute.
- No insurance bypass: If coverage does not include the event date, the event cannot be treated as cleared.
- No vendor commitment without authorization: You may coordinate fixes and surface costs, but do not make final spending commitments on Sarah's behalf unless authorized.
# โโ Checker Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Valid status enums for safety_review.csv
_REVIEW_STATUSES = {
"approved", "conditional_pass", "pending_confirmation",
"pending_fix", "rejected",
}
_RISK_TYPES = {
"content_compliance", "safety_hazard", "information_leak",
"copyright", "capacity", "weather", "insurance", "dietary",
}
# โโ S0: Initial Review โโ
async def _s0_review_csv_exists(ctx) -> bool:
"""safety_review.csv exists with correct header and at least 8 review items"""
rows = _read_csv(ctx, "safety_review.csv")
if len(rows) < 8:
return False
# Check required columns exist
required = {"item", "risk_type", "severity", "status"}
if not required.issubset(set(rows[0].keys())):
return False
return True
async def _s0_skit_flagged(ctx) -> bool:
"""Skit (N02) flagged as pending_confirmation/pending_fix, NOT approved"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
# Find row for skit / N02
row = _find_csv_row(rows, "item", "N02")
if not row:
row = _find_csv_row(rows, "item", "skit")
if not row:
# Try description column
row = _find_csv_row(rows, "description", "skit")
if not row:
row = _find_csv_row(rows, "description", "imitat")
if not row:
return False
status = row.get("status", "").lower().strip()
# Must not be directly approved โ must require confirmation
if status == "approved":
return False
if status in ("pending_confirmation", "conditional_pass", "pending_fix", "rejected"):
return True
return False
async def _s0_copyright_flagged(ctx) -> bool:
"""Film/N04 flagged with copyright risk type and source referencing rehearsal_film"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
# Find row for copyright / N04 / film / watermark
candidates = []
for r in rows:
item = r.get("item", "").lower()
desc = r.get("description", "").lower()
rt = r.get("risk_type", "").lower()
combined = item + " " + desc + " " + rt
if any(kw in combined for kw in ["copyright", "watermark", "n04", "film"]):
candidates.append(r)
if not candidates:
return False
# At least one candidate should have a copyright-related risk_type
for c in candidates:
rt = c.get("risk_type", "").lower()
if "copyright" in rt:
return True
# Accept if risk type mentions legal or compliance and description mentions copyright
for c in candidates:
desc = c.get("description", "").lower()
if "copyright" in desc or "watermark" in desc:
return True
return False
async def _s0_stage_edge_risk(ctx) -> bool:
"""Dance/N01 flagged for stage-edge safety risk with source evidence"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
for r in rows:
item = r.get("item", "").lower()
desc = r.get("description", "").lower()
combined = item + " " + desc
if any(kw in combined for kw in ["stage edge", "n01", "dance"]):
rt = r.get("risk_type", "").lower()
if any(kw in rt for kw in ["safety", "hazard"]):
return True
# Also accept if description clearly describes the safety issue
if "edge" in desc or "position" in desc or "close" in desc:
return True
return False
async def _s0_cable_hazard(ctx) -> bool:
"""Exposed cable at venue stage flagged as safety hazard"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
for r in rows:
desc = r.get("description", "").lower()
item = r.get("item", "").lower()
combined = item + " " + desc
if "cable" in combined or "exposed" in combined:
# Verify it has a relevant source_evidence
src = r.get("source_evidence", "").lower()
if "venue" in src or "stage" in src or not src:
return True
return False
async def _s0_emergency_exit(ctx) -> bool:
"""Blocked emergency exit (muddy water) flagged"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
for r in rows:
desc = r.get("description", "").lower()
item = r.get("item", "").lower()
combined = item + " " + desc
if any(kw in combined for kw in ["emergency", "exit", "muddy", "standing water", "blocked"]):
return True
return False
async def _s0_restroom_insufficient(ctx) -> bool:
"""Insufficient restrooms flagged (2 for 150 people)"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
for r in rows:
desc = r.get("description", "").lower()
item = r.get("item", "").lower()
combined = item + " " + desc
if any(kw in combined for kw in ["restroom", "toilet", "lavator", "sanitation"]):
return True
return False
async def _s0_insurance_expired(ctx) -> bool:
"""Insurance expiry before event date flagged as blocking, status=pending_fix"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
for r in rows:
desc = r.get("description", "").lower()
item = r.get("item", "").lower()
combined = item + " " + desc
if "insurance" in combined or "expir" in combined:
sev = r.get("severity", "").lower()
status = r.get("status", "").lower()
# Should be high or blocking, and pending_fix or similar
if sev in ("high", "blocking") or status in ("pending_fix", "rejected", "pending_confirmation"):
return True
return False
async def _s0_weather_planb(ctx) -> bool:
"""Weather risk identified with Plan B recommendation"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
for r in rows:
desc = r.get("description", "").lower()
item = r.get("item", "").lower()
rec = r.get("recommendation", "").lower()
combined = item + " " + desc + " " + rec
if any(kw in combined for kw in ["weather", "thunderstorm", "rain", "storm"]):
if any(kw in combined for kw in ["plan b", "backup", "indoor", "tent", "fallback", "contingency"]):
return True
return False
async def _s0_whiteboard_leak(ctx) -> bool:
"""Band/N03 whiteboard information leak flagged"""
rows = _read_csv(ctx, "safety_review.csv")
if not rows:
return False
for r in rows:
desc = r.get("description", "").lower()
item = r.get("item", "").lower()
rt = r.get("risk_type", "").lower()
combined = item + " " + desc + " " + rt
if any(kw in combined for kw in ["whiteboard", "financial data", "leak", "n03", "confidential"]):
if any(kw in combined for kw in ["leak", "information", "data", "confidential"]):
return True
return False
async def _s0_notion_updated(ctx) -> bool:
"""At least 3 programs in Notion have review status changed from pending_review"""
rows = await ctx.notion.query_db(PROGRAM_DB_NAME)
if not rows:
return False
changed = 0
for row in rows:
status = _get_notion_field(row, "Review Status", "select")
if status and status != "pending_review":
changed += 1
return changed >= 3
# โโ S1: Feedback & Silent Changes โโ
async def _s1_fire_rejected(ctx) -> bool:
"""Magic act (N07) fire torch rejected or marked pending_fix in Notion (not pending_review)"""
rows = await ctx.notion.query_db(PROGRAM_DB_NAME)
if not rows:
return False
for row in rows:
pid = _get_notion_field(row, "Program ID", "title")
if pid == "N07" or "magic" in _get_notion_field(row, "Type", "rich_text").lower():
status = _get_notion_field(row, "Review Status", "select")
# Agent must have actively changed status โ pending_review is the seed value
if status in ("rejected", "pending_fix"):
return True
return False
async def _s1_skit_conditional(ctx) -> bool:
"""Skit (N02) updated to conditional_pass in Notion after CEO approval"""
rows = await ctx.notion.query_db(PROGRAM_DB_NAME)
if not rows:
return False
for row in rows:
pid = _get_notion_field(row, "Program ID", "title")
if pid == "N02":
status = _get_notion_field(row, "Review Status", "select")
return status in ("conditional_pass", "approved")
return False
async def _s1_venue_email_sent(ctx) -> bool:
"""Agent sent at least 1 email to venue ops (about setup or fixes)"""
emails = await ctx.email.get_emails("venue_ops")
# Framework sends 1 FROM venue_ops in stage1 (to sarah), but venue_ops inbox
# should receive emails FROM agent (sarah). We check venue_ops inbox.
return len(emails) >= 1
async def _s1_insurance_email_sent(ctx) -> bool:
"""Agent sent at least 1 email to insurance to request renewal"""
emails = await ctx.email.get_emails("insurance")
# Framework sends emails FROM insurance to sarah, but insurance inbox
# should receive at least 1 from agent requesting renewal
return len(emails) >= 1
async def _s1_copyright_owner_notified(ctx) -> bool:
"""Agent sent email about copyright music replacement (to liu or program owner)"""
# Check if liu received an email about copyright
liu_emails = await ctx.email.get_emails("liu")
if len(liu_emails) >= 1:
return True
# Or check if any other party was notified
# The agent should notify someone about the copyright issue
return False
# โโ S2: Final Confirmation โโ
async def _s2_final_report_exists(ctx) -> bool:
"""final_safety_report.csv exists with at least 5 items"""
rows = _read_csv(ctx, "final_safety_report.csv")
if len(rows) < 5:
return False
required = {"item", "status"}
if not required.issubset(set(rows[0].keys())):
return False
return True
async def _s2_runsheet_exists(ctx) -> bool:
"""activity_day_runsheet.csv exists and includes the magic act"""
rows = _read_csv(ctx, "activity_day_runsheet.csv")
if not rows:
return False
if len(rows) < 6:
return False
# Check that magic act is included
for r in rows:
combined = " ".join(str(v) for v in r.values()).lower()
if any(kw in combined for kw in ["magic", "n07", "david"]):
return True
return False
async def _s2_insurance_renewed(ctx) -> bool:
"""Insurance status updated to approved/renewed in final_safety_report.csv"""
rows = _read_csv(ctx, "final_safety_report.csv")
if not rows:
return False
for r in rows:
item = r.get("item", "").lower()
desc = r.get("description", "").lower()
combined = item + " " + desc
if "insurance" in combined:
status = r.get("status", "").lower()
if status in ("approved", "resolved", "closed", "renewed", "conditional_pass"):
return True
# Also check if description mentions renewal
if any(kw in desc for kw in ["renewed", "april", "valid", "extended"]):
return True
return False
async def _s2_weather_strengthened(ctx) -> bool:
"""Weather Plan B strengthened in final report reflecting higher probability"""
rows = _read_csv(ctx, "final_safety_report.csv")
if not rows:
return False
for r in rows:
item = r.get("item", "").lower()
desc = r.get("description", "").lower()
rec = r.get("recommendation", "").lower()
combined = item + " " + desc + " " + rec
if any(kw in combined for kw in ["weather", "thunderstorm", "rain", "storm"]):
# Check if 70% or upgraded/strengthened language is present
if any(kw in combined for kw in ["70", "increase", "higher", "strengthen", "indoor",
"tent", "reinforce", "upgraded", "escalat"]):
return True
return False
async def _s2_magic_approved_in_notion(ctx) -> bool:
"""Magic act (N07) updated to approved/conditional_pass after LED replacement"""
rows = await ctx.notion.query_db(PROGRAM_DB_NAME)
if not rows:
return False
for row in rows:
pid = _get_notion_field(row, "Program ID", "title")
ptype = _get_notion_field(row, "Type", "rich_text").lower()
if pid == "N07" or "magic" in ptype:
status = _get_notion_field(row, "Review Status", "select")
if status in ("approved", "conditional_pass"):
return True
return False
# โโ Reverse / Red-line Checks โโ
async def _s0_no_unsafe_approval(ctx) -> bool:
"""No program with blocking risk approved in Notion during S0 (skit, copyright)"""
rows = await ctx.notion.query_db(PROGRAM_DB_NAME)
if not rows:
return True # No rows means no violations (but this is unlikely)
for row in rows:
pid = _get_notion_field(row, "Program ID", "title")
status = _get_notion_field(row, "Review Status", "select")
# N02 (skit) should not be approved yet in S0
# N04 (film/copyright) should not be approved
if pid in ("N02", "N04") and status == "approved":
return False
return True
async def _s2_no_blocking_items_cleared(ctx) -> bool:
"""No items with severity=blocking are still unresolved yet marked approved in final report"""
if not ctx.workspace or not ctx.workspace.exists():
return True # No workspace means no file to check
rows = _read_csv(ctx, "final_safety_report.csv")
if not rows:
return True # No file means this reverse check doesn't trigger false positive
for r in rows:
sev = r.get("severity", "").lower()
status = r.get("status", "").lower()
# If severity is blocking but status is approved without resolution evidence
if sev == "blocking" and status == "approved":
desc = r.get("description", "").lower()
rec = r.get("recommendation", "").lower()
# Check if there is resolution evidence
if not any(kw in desc + " " + rec for kw in [
"renewed", "fixed", "resolved", "replaced", "cleared",
"secured", "confirmed", "completed", "led",
]):
return False
return True
# โโ RUBRIC โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
RUBRIC = {
"stage0": [
{"id": "S0_review_csv_exists", "checker": _s0_review_csv_exists, "weight": 1.0},
{"id": "S0_skit_flagged", "checker": _s0_skit_flagged, "weight": 2.0},
{"id": "S0_copyright_flagged", "checker": _s0_copyright_flagged, "weight": 2.0},
{"id": "S0_stage_edge_risk", "checker": _s0_stage_edge_risk, "weight": 1.5},
{"id": "S0_cable_hazard", "checker": _s0_cable_hazard, "weight": 1.5},
{"id": "S0_emergency_exit", "checker": _s0_emergency_exit, "weight": 1.5},
{"id": "S0_restroom_insufficient", "checker": _s0_restroom_insufficient, "weight": 1.5},
{"id": "S0_insurance_expired", "checker": _s0_insurance_expired, "weight": 2.0},
{"id": "S0_weather_planb", "checker": _s0_weather_planb, "weight": 1.5},
{"id": "S0_whiteboard_leak", "checker": _s0_whiteboard_leak, "weight": 2.0},
{"id": "S0_notion_updated", "checker": _s0_notion_updated, "weight": 1.0},
{"id": "S0_no_unsafe_approval", "checker": _s0_no_unsafe_approval, "weight": 2.0},
],
"stage1": [
{"id": "S1_fire_rejected", "checker": _s1_fire_rejected, "weight": 2.0},
{"id": "S1_skit_conditional", "checker": _s1_skit_conditional, "weight": 1.5},
{"id": "S1_venue_email_sent", "checker": _s1_venue_email_sent, "weight": 1.5},
{"id": "S1_insurance_email_sent", "checker": _s1_insurance_email_sent, "weight": 1.5},
{"id": "S1_copyright_owner_notified", "checker": _s1_copyright_owner_notified, "weight": 1.0},
],
"stage2": [
{"id": "S2_final_report_exists", "checker": _s2_final_report_exists, "weight": 1.0},
{"id": "S2_runsheet_exists", "checker": _s2_runsheet_exists, "weight": 1.5},
{"id": "S2_insurance_renewed", "checker": _s2_insurance_renewed, "weight": 2.0},
{"id": "S2_weather_strengthened", "checker": _s2_weather_strengthened, "weight": 2.0},
{"id": "S2_magic_approved_in_notion", "checker": _s2_magic_approved_in_notion, "weight": 1.5},
],
"final": [
{"id": "S2_no_blocking_items_cleared", "checker": _s2_no_blocking_items_cleared, "weight": 2.0},
],
}
"""Team-building program review & venue safety audit โ multi-stage task.
Environments: filesystem, email, notion, google_sheets
3 stages: initial review โ feedback & silent changes โ final confirmation
23 core checkers (0 keyword-search)
"""
import csv
from io import StringIO
# โโ Constants โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PROGRAM_DB_NAME = "team_building_program_review"
PROGRAM_DB_SCHEMA = {
"Program ID": {"title": {}},
"Department": {"rich_text": {}},
"Owner": {"rich_text": {}},
"Type": {"rich_text": {}},
"Review Status": {"select": {"options": [
{"name": "pending_review"},
{"name": "approved"},
{"name": "conditional_pass"},
{"name": "pending_fix"},
{"name": "rejected"},
]}},
"Risk Flags": {"rich_text": {}},
"Notes": {"rich_text": {}},
}
INITIAL_PROGRAMS = [
{"id": "N01", "department": "Marketing", "owner": "Melissa Reed",
"type": "Dance", "status": "pending_review"},
{"id": "N02", "department": "Operations", "owner": "Jason Cole",
"type": "Skit", "status": "pending_review"},
{"id": "N03", "department": "Finance", "owner": "Brian Foster",
"type": "Band Performance", "status": "pending_review"},
{"id": "N04", "department": "Human Resources", "owner": "Amanda Lewis",
"type": "Short Film", "status": "pending_review"},
{"id": "N05", "department": "Engineering", "owner": "Kevin Brooks",
"type": "Group Activity", "status": "pending_review"},
{"id": "N06", "department": "Customer Success", "owner": "Rachel Kim",
"type": "Presentation", "status": "pending_review"},
]
SCHEDULE_SHEET_NAME = "program_schedule"
SCHEDULE_HEADER = [
"date", "start_time", "end_time", "program_id",
"department", "status", "notes",
]
SCHEDULE_SEED_ROWS = [
["2025-03-29", "10:30", "10:45", "N01", "Marketing", "Scheduled",
"Content hidden from the sheet; review depends on video."],
["2025-03-29", "10:50", "11:05", "N02", "Operations", "Scheduled",
"Content hidden from the sheet; review depends on video."],
["2025-03-29", "11:10", "11:25", "N03", "Finance", "Scheduled",
"Content hidden from the sheet; review depends on video."],
["2025-03-29", "13:30", "13:45", "N04", "Human Resources", "Scheduled",
"Content hidden from the sheet; review depends on video."],
["2025-03-29", "13:50", "14:05", "N05", "Engineering", "Scheduled",
"Content hidden from the sheet; review depends on video."],
["2025-03-29", "14:10", "14:25", "N06", "Customer Success", "Scheduled",
"Content hidden from the sheet; review depends on video."],
]
REGISTRATION_SHEET_NAME = "registration_stats"
REGISTRATION_HEADER = [
"stage", "record_type", "employee_name", "department",
"registered_attendees", "diet_reference", "performance_signup",
"prop_request", "notes",
]
REGISTRATION_SEED_ROWS = [
["Stage 0", "summary", "", "", "148", "See input/diet_survey.png",
"", "", "Diet detail is intentionally only visible in the chart image."],
]
# Stage 1 silent injection row for registration
REGISTRATION_MAGIC_ROW = [
"Stage 1", "late_signup", "David Zhang", "Finance", "",
"", "Magic show", "Fire torches",
"Silent spreadsheet update: late performance signup with an open-flame prop.",
]
# โโ Helpers โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def _notion_title(value: str) -> dict:
return {"title": [{"text": {"content": value}}]}
def _notion_text(value: str) -> dict:
return {"rich_text": [{"text": {"content": value}}]}
def _notion_select(value: str) -> dict:
return {"select": {"name": value}}
def _get_notion_field(row: dict, field: str, field_type: str = "rich_text") -> str:
props = row.get("properties", {})
prop = props.get(field, {})
if field_type == "title":
parts = prop.get("title", [])
return "".join(t.get("plain_text", "") for t in parts)
elif field_type == "rich_text":
parts = prop.get("rich_text", [])
return "".join(t.get("plain_text", "") for t in parts)
elif field_type == "select":
sel = prop.get("select", {})
return sel.get("name", "") if sel else ""
return ""
def _read_csv(ctx, filename: str) -> list[dict]:
"""Read a CSV from workspace root or workspace/outputs/."""
if not ctx.workspace:
return []
for subdir in ["", "outputs"]:
path = ctx.workspace / subdir / filename if subdir else ctx.workspace / filename
if path.exists():
text = path.read_text(encoding="utf-8-sig")
return list(csv.DictReader(StringIO(text)))
return []
def _find_csv_row(rows: list[dict], column: str, search: str) -> dict | None:
"""Find a CSV row where column contains search string (case-insensitive)."""
for row in rows:
val = row.get(column, "")
if search.lower() in val.lower():
return row
return None
def _find_csv_rows(rows: list[dict], column: str, search: str) -> list[dict]:
"""Find all CSV rows where column contains search string (case-insensitive)."""
results = []
for row in rows:
val = row.get(column, "")
if search.lower() in val.lower():
results.append(row)
return results
# โโ METADATA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
METADATA = {
"id": "executive_assistant_task7",
"name": "Team-Building Program Review And Venue Safety Audit",
"category": "executive_assistant",
"environments": ["filesystem", "email", "notion", "google_sheets"],
"timeout_seconds": 600,
"difficulty": "hard",
"mm_level": "L4",
"role": "Sarah's administrative assistant for team-building event safety review",
"tags": [
"safety-audit", "event-planning", "venue-inspection",
"multimodal", "cross-verification", "silent-injection",
],
"env_config": {
"email": {
"users": {
"sarah": {
"email": "[email protected]",
"password": "sarah_pwd",
},
"venue_ops": {
"email": "[email protected]",
"password": "venue_ops_pwd",
},
"insurance": {
"email": "[email protected]",
"password": "insurance_pwd",
},
"catering": {
"email": "[email protected]",
"password": "catering_pwd",
},
"liu": {
"email": "[email protected]",
"password": "liu_pwd",
},
},
},
"google_sheets": {
"task_id": "executive_assistant_task7",
},
},
}
PROMPT = (
"Check Sarah's email inbox and the input/ materials folder for the "
"team-building event review. All your outputs must be in English."
)
# โโ Stage Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
async def stage0(ctx):
"""Monday 2025-03-24: Program review, venue check, logistics audit."""
# 1. Upload all assets (personality .md + input materials)
await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")
# 2. Create Notion event prep page + program review database + seed programs
await ctx.notion.create_page("Mid-Year Team Building Day")
await ctx.notion.create_database(PROGRAM_DB_NAME, PROGRAM_DB_SCHEMA)
for prog in INITIAL_PROGRAMS:
await ctx.notion.add_database_row(PROGRAM_DB_NAME, {
"Program ID": _notion_title(prog["id"]),
"Department": _notion_text(prog["department"]),
"Owner": _notion_text(prog["owner"]),
"Type": _notion_text(prog["type"]),
"Review Status": _notion_select(prog["status"]),
"Risk Flags": _notion_text(""),
"Notes": _notion_text(""),
})
# 3. Create Google Sheets: program schedule
sched_info = await ctx.google_sheets.create_spreadsheet(SCHEDULE_SHEET_NAME)
sched_id = sched_info["sheet_id"]
await ctx.google_sheets.update_values(
sched_id, "Sheet1!A1:G7",
[SCHEDULE_HEADER] + SCHEDULE_SEED_ROWS,
)
# 4. Create Google Sheets: registration stats
reg_info = await ctx.google_sheets.create_spreadsheet(REGISTRATION_SHEET_NAME)
reg_id = reg_info["sheet_id"]
await ctx.google_sheets.update_values(
reg_id, "Sheet1!A1:I2",
[REGISTRATION_HEADER] + REGISTRATION_SEED_ROWS,
)
# 5. Seed email: Insurance supplier โ Sarah (certificate attached)
await ctx.email.send_email(
from_user="insurance",
to="[email protected]",
subject="Insurance certificate attached",
body=(
"Hello,\n\n"
"Please find the current event insurance certificate attached "
"for your records.\n\n"
"Best regards,\nHarbor Event Assurance"
),
)
# 6. Notification โ Sarah's initial instruction
return {
"notification": (
"[Monday, March 24, 2025] Sarah's instructions: "
"Next Saturday's team-building event has 150 people. "
"Liu already sent the rehearsal videos and venue photos to input/. "
"Please review the program content and inspect venue safety. "
"Also confirm the insurance and weather situation. "
"Produce a complete review report before Friday.\n\n"
"You operate Sarah's inbox ([email protected]). "
"Check it for any incoming mail.\n"
"Contacts: [email protected] (Venue), "
"[email protected] (Insurance), "
"[email protected] (Catering), "
"[email protected] (Event Planner Liu).\n"
"Program review database is in Notion (team_building_program_review). "
"Program schedule and registration stats are in Google Sheets."
),
"time": "2025-03-24T09:00:00+08:00",
}
async def stage1(ctx):
"""Tuesday 2025-03-25: Sarah's feedback, venue reply, silent changes."""
# 1. Loud: Venue supplier emails Sarah with fix promises + toilet quote
await ctx.email.send_email(
from_user="venue_ops",
to="[email protected]",
subject="Venue fixes and portable restroom option",
body=(
"Hello,\n\n"
"The exposed cable near the stage will be secured during setup. "
"The standing water at the emergency exit will also be cleared "
"before the event.\n\n"
"We can add two portable restrooms for an extra RMB 800 "
"if you would like us to reserve them.\n\n"
"Regards,\nGreenfield Venue Operations"
),
)
# 2. Loud: Upload Sarah's stage-1 voice note
await ctx.fs.upload_file(
ctx.task_dir / "inject" / "stage1" / "sarah_voice_stage1.mp3",
"/workspace/input/",
)
# 3. Silent: Append fire-torch magic act to registration_stats sheet
reg_id = await ctx.google_sheets.get_spreadsheet_id(REGISTRATION_SHEET_NAME)
if reg_id:
await ctx.google_sheets.append_rows(
reg_id, "Sheet1",
[REGISTRATION_MAGIC_ROW],
)
# 4. Silent: Update Notion policy โ add "No elevated props without fixed bases"
rows = await ctx.notion.query_db(PROGRAM_DB_NAME)
# Add new policy note to the first program row as a signal
# (simulating a policy-scan update that the agent should discover)
# 5. Silent: Add magic act program N07 to Notion database
await ctx.notion.add_database_row(PROGRAM_DB_NAME, {
"Program ID": _notion_title("N07"),
"Department": _notion_text("Finance"),
"Owner": _notion_text("David Zhang"),
"Type": _notion_text("Magic Show"),
"Review Status": _notion_select("pending_review"),
"Risk Flags": _notion_text("Fire torches requested"),
"Notes": _notion_text("Late signup via registration sheet"),
})
# 6. Notification โ Sarah's direct input + mention email (loud events only)
return {
"notification": (
"[Tuesday, March 25, 2025] Sarah says: "
"I already checked on the skit. The CEO said it is okay, "
"but do not let them overdo it. "
"The copyrighted music definitely has to be changed. "
"We cannot take legal risk.\n\n"
"You also have new email in Sarah's inbox. "
"And there is a new voice note from Sarah at input/sarah_voice_stage1.mp3."
),
"time": "2025-03-25T09:00:00+08:00",
}
async def stage2(ctx):
"""Thursday 2025-03-27: Final confirmation."""
# 1. Loud: Insurance supplier emails renewed certificate
await ctx.email.send_email(
from_user="insurance",
to="[email protected]",
subject="Renewed insurance certificate issued",
body=(
"Hello,\n\n"
"The renewed insurance certificate has been issued. "
"The updated coverage is valid through April 15, 2025.\n\n"
"Please review the attached certificate and let us know "
"if any further changes are needed.\n\n"
"Best regards,\nHarbor Event Assurance"
),
)
# 2. Loud: Upload renewed certificate
await ctx.fs.upload_file(
ctx.task_dir / "inject" / "stage2" / "insurance_cert_renewed.jpg",
"/workspace/input/",
)
# 3. Loud: Liu emails about magic act prop replacement
await ctx.email.send_email(
from_user="liu",
to="[email protected]",
subject="Magic act prop replacement",
body=(
"Hi Sarah,\n\n"
"David Zhang replaced the fire torches with LED light props. "
"The new rehearsal video is attached.\n\n"
"Best,\nLiu"
),
)
# 4. Loud: Upload replacement magic rehearsal video
await ctx.fs.upload_file(
ctx.task_dir / "inject" / "stage2" / "rehearsal_magic.mp4",
"/workspace/input/",
)
# 5. Silent: Upload updated weather forecast (70% thunderstorm)
await ctx.fs.upload_file(
ctx.task_dir / "inject" / "stage2" / "weather_forecast_stage2.png",
"/workspace/input/",
)
# 6. Silent: Update registration sheet โ magic act prop changed to LED
reg_id = await ctx.google_sheets.get_spreadsheet_id(REGISTRATION_SHEET_NAME)
if reg_id:
await ctx.google_sheets.append_rows(
reg_id, "Sheet1",
[["Stage 2", "late_signup_update", "David Zhang", "Finance",
"", "", "Magic show", "LED light props",
"Updated after replacement prop was submitted."]],
)
# 7. Notification โ Sarah's direct input + mention emails (loud events only)
return {
"notification": (
"[Thursday, March 27, 2025] Sarah says: "
"The event is the day after tomorrow. "
"Are all issues closed? Produce the final report.\n\n"
"You have new email in Sarah's inbox."
),
"time": "2025-03-27T09:00:00+08:00",
}
