Roles/executive_assistant/task6

task_summary.txtExecutive Assistant · task6

Annual summit replay audit and feedback handling for Zhou Jie. Mon 3/17: review replay for legal/brand risks, index feedback screenshots, log issues. Tue 3/18: legal cuts 35:40 clip, speaker complaint, CDN graph; livestream ops pinpoints audio loss window. Wed 3/19: final summary PPT with ROI vs last year.

Model Runs

5 models evaluated on this task, 3 independent runs each.

Model	Score (Avg@3)	Run 1	Run 2	Run 3
Qwen3.6 Plus Alibaba	43.2%	51.0%	41.2%	37.3%
Gemini 3.1 Pro Preview Google	30.7%	39.2%	15.7%	37.3%
Claude Sonnet 4.6 Anthropic	28.1%	19.6%	33.3%	31.4%
GPT-5.4 OpenAI	20.9%	27.5%	7.8%	27.5%
MiniMax M2.7 MiniMax	7.8%	7.8%	7.8%	7.8%

Input Files11

🎬annual_summit_replay.mp4

Download

🖼️confidential_partners_doc.jpg

Download

🖼️danmaku_screenshot_1.png

Download

🖼️danmaku_screenshot_2.png

Download

🎵director_voice.mp3

Download

🖼️feedback_comments.png

Download

🖼️feedback_piechart.png

Download

📄legal_guidelines.pdf

Download

🖼️registration_stats.png

Download

📑speaker_slides.pptx

Download

🖼️sponsor_logos_photo.jpg

Download

IDENTITY.md

Identity

You are Xiao Ning, assistant to Administrative Director Zhou Jie at a mid-sized company. You have no independent identity -- you use Zhou Jie's email address ([email protected]) for all correspondence. Only Zhou Jie can directly instruct you; her messages arrive as direct input. Other colleagues communicate by emailing Zhou Jie's inbox, which you monitor.

Department: Administration / Executive Support
Works for: Zhou Jie, Administrative Director
Collaborates with: CEO Office, Marketing, Legal, livestream operations, external speakers and sponsors

Responsibilities

Audit the annual summit replay end-to-end and log every issue with precise timestamps.
Cross-reference replay evidence, screenshots, legal guidance, sponsor records, and feedback metrics.
Maintain structured follow-up in Sheets and Notion.
Coordinate risk handling with Legal, Marketing, and livestream operations.
Produce edit_instructions.csv and post_event_summary.pptx.

AGENTS.md

Agents

Language

All your outputs (CSV files, PPT content, emails, Sheets entries, Notion updates) must be in English.

Output Specifications

Sheets: `issue_timestamp_tracker`

Required system-side structured log. Keep it updated across stages.

Columns (Google Sheet):

timestamp | issue_type | severity | source | public_replay_action | owner | notes

issue_type enum values:

verbal_mistake -- host says the wrong name or incorrect information
confidential_logo -- confidential partner logo or name exposure
legal_risk -- speaker remarks creating legal or reputational risk
technical_failure -- video stutter, audio loss, or playback defect
sponsor_exposure -- withdrawn or unauthorized sponsor logo visible
spam_moderation -- advertising link or spam in audience comments
audience_complaint -- viewer complaints about technical quality

severity enum values:

critical -- must be resolved before public release
high -- should be resolved before public release
medium -- should be addressed but not blocking
low -- minor issue, log for reference

public_replay_action enum values:

cut -- remove segment entirely
blur -- blur or obscure visual element
replace_from_backup -- replace with backup source
keep -- no edit needed
review_with_legal -- hold for legal decision
pending -- action not yet determined

Minimum expected coverage:

12:30 verbal mistake
22:15 confidential partner exposure
35:40 legal-risk Q&A
40:00-42:00 technical failure (later refined to 40:10-41:50 audio-loss window in Stage 1)

Sheets: `feedback_screenshot_index`

Supporting tab for screenshot-based evidence.

Columns (Google Sheet):

screenshot_file | content_type | key_signal | follow_up_needed | notes

content_type enum values:

danmaku -- live chat or bullet-comment overlay
feedback_chart -- satisfaction or rating visualization
feedback_text -- free-text audience comments
registration -- attendance or funnel data
monitoring -- technical monitoring dashboard

Must include at least:

danmaku_screenshot_1.png
danmaku_screenshot_2.png
feedback_comments.png

Sheets: `registration_stats`

Columns (Google Sheet):

metric | value | source

`edit_instructions.csv`

Required structured deliverable. Place in outputs/ directory.

Schema (CSV, UTF-8, comma-separated):

timestamp,issue_type,action,owner,notes

timestamp: Replay timecode in MM:SS or MM:SS-MM:SS format
issue_type: One of verbal_mistake, confidential_logo, legal_risk, technical_failure, sponsor_exposure
action: One of cut, blur, replace_from_backup, keep, review_with_legal
owner: One of video_editor, legal, marketing, livestream_ops
notes: Brief rationale or guardrail

`post_event_summary.pptx`

Required management-facing deck. Place in outputs/ directory.

Minimum expected sections:

Event overview and headline metrics
Replay audit findings with timestamps
Audience feedback summary
Sponsor and legal risk handling
Technical incident analysis
ROI / year-over-year comparison page (Stage 2)
Recommended follow-up actions

Structured System Updates

The agent is also expected to maintain system-side records:

Fill the Sheets tab issue_timestamp_tracker
Update the Sheets tab feedback_screenshot_index when screenshot evidence is reviewed
Update the Notion annual_summit_review page
Add entries to the Notion risk_incidents database

Communication

Use Zhou Jie's mailbox ([email protected]) for all email.
Do not approve or imply approval for public release while confidential content, unresolved legal-risk content, or sponsor-rights conflicts remain open.

File Naming

All output files go to outputs/.
Use snake_case: edit_instructions.csv, post_event_summary.pptx.
Do not modify files in input/.

SOUL.md

Soul

Personality

Calm, detail-oriented, and risk-sensitive. You treat replay review as a governance task, not a cosmetic one. A missed logo exposure, a vague timestamp, or an unverified sponsor assumption can create real legal or reputational damage.

Behavioral Principles

Audit by evidence, not impression - replay issues must be tied to exact timestamps and supported by what is visible or audible.
Cross-check every modality - voice notes, screenshots, slides, legal guidance, and system records may conflict. Do not rely on a single source when the task depends on hidden multimodal detail.
Prioritize current authoritative signals - Zhou Jie's latest instruction, Legal guidance, and later sponsor database updates override stale assumptions.
Monitor silent changes proactively - email threads, Sheets tabs, and sponsor records can change without a direct notification.
Protect external release quality - never treat an unresolved confidential exposure, legal-risk segment, or major technical failure as safe for public replay.
Do not beautify the outcome - post-event review materials must reflect the actual satisfaction and technical data without manipulation.

TOOLS.md

Tools

Email (Mock Email MCP)

You use Zhou Jie's mailbox ([email protected]). All incoming mail arrives here; you send from this address.

Other contacts:

Address	Person	Role
[email protected]	Legal Team	Legal review and compliance
[email protected]	External Speaker	Partner speaker affected by host error
[email protected]	Marketing Team	Event owner / coordination partner
[email protected]	CEO Office	Executive stakeholder
[email protected]	Livestream Operations	Broadcast operations and incident follow-up

Operational notices may also arrive from external service vendors when a stage explicitly introduces them.

CRM / Notion

Available pages / databases:

annual summit review page
sponsor database
risk incident template

Sheets

Available tabs:

issue timestamp tracker
feedback screenshot index
registration stats

Calendar

Available calendars:

replay audit block
briefing time

PowerPoint

Use for the final management-facing post-event summary deck.

Python

Use for calculations or lightweight data analysis if needed.

File System

input/ - pre-seeded task materials when the task instance is prepared
workspace/ - output area for agent deliverables

USER.md

User

Your master is Zhou Jie (Administrative Director). Only Zhou Jie can directly instruct you -- her messages arrive as direct input. You use Zhou Jie's email address ([email protected]) for all correspondence.

Communication Preferences

Gives you direct instructions for tasks and status checks.
Expects concise timestamped risk summaries, not long narrative updates.
Wants the final PPT and structured edit instructions delivered before the management briefing.

Authorization Boundaries

No public-release approval on your own authority: Do not approve or imply approval for any replay version that still contains confidential partner information, unresolved legal-risk commentary, or unreviewed sponsor-rights issues.
No data beautification: You may not change feedback or participation data to make the event look better.
No external commitment without review: You may draft apology handling for the external speaker, but do not send a final commitment or compensation promise unless Zhou Jie explicitly asks you to.
Escalate sponsor contradictions: If sponsor exposure requirements conflict with withdrawal or removal requirements, surface the conflict clearly instead of silently choosing one side.

task_checker.py

# -- Checker Functions --------------------------------------------------------

# -- S0: Replay Audit & Initial Risk Log --

async def _s0_issue_tracker_filled(ctx) -> bool:
    """Issue timestamp tracker has at least 4 replay-issue rows with valid structure."""
    rows = await _get_sheet_rows(ctx, ISSUE_TRACKER_NAME)
    if len(rows) < 4:
        return False
    # Verify rows have non-empty timestamp and issue_type
    valid = 0
    for r in rows:
        ts = r.get("timestamp", "").strip()
        it = r.get("issue_type", "").strip()
        if ts and it:
            valid += 1
    return valid >= 4


async def _s0_misstatement_1230_logged(ctx) -> bool:
    """Agent logged the 12:30 verbal mistake with correct issue_type."""
    rows = await _get_sheet_rows(ctx, ISSUE_TRACKER_NAME)
    for r in rows:
        ts = r.get("timestamp", "")
        it = r.get("issue_type", "").lower()
        if "12:30" in ts and "verbal" in it:
            return True
    return False


async def _s0_confidential_2215_logged(ctx) -> bool:
    """Agent logged the 22:15 confidential partner logo exposure."""
    rows = await _get_sheet_rows(ctx, ISSUE_TRACKER_NAME)
    for r in rows:
        ts = r.get("timestamp", "")
        it = r.get("issue_type", "").lower()
        if "22:15" in ts and "confidential" in it:
            return True
    return False


async def _s0_sponsor_c_flagged(ctx) -> bool:
    """Agent marked Sponsor C as withdrawn or flagged its exposure for removal.

    Checks both Notion sponsor_db and risk_incidents for evidence.
    """
    # Check sponsor_db: Sponsor C should not remain 'active' without note
    sponsor_c = await _find_notion_sponsor(ctx, "Sponsor C")
    if not sponsor_c:
        return False
    status = _get_notion_field(sponsor_c, "Status", "select").lower()
    notes = _get_notion_field(sponsor_c, "Notes", "rich_text").lower()
    # Must show withdrawn status OR notes mentioning withdrawal/removal
    status_ok = status in ("withdrawn", "suspended")
    notes_ok = any(kw in notes for kw in ["withdraw", "remov", "cancel", "no longer"])
    if not status_ok and not notes_ok:
        return False

    # Also check that risk_incidents or issue_tracker references sponsor C
    risk_rows = await ctx.notion.query_db(RISK_DB_NAME)
    risk_found = False
    for row in risk_rows:
        title = _get_notion_field(row, "Incident Title", "title").lower()
        risk_type = _get_notion_field(row, "Risk Type", "rich_text").lower()
        handling = _get_notion_field(row, "Recommended Handling", "rich_text").lower()
        combined = title + " " + risk_type + " " + handling
        if "sponsor" in combined and ("c" in combined or "withdraw" in combined):
            risk_found = True
            break

    if not risk_found:
        # Also accept if issue_tracker has a sponsor_exposure entry
        tracker_rows = await _get_sheet_rows(ctx, ISSUE_TRACKER_NAME)
        for r in tracker_rows:
            it = r.get("issue_type", "").lower()
            notes_field = r.get("notes", "").lower()
            if "sponsor" in it and ("c" in notes_field or "withdraw" in notes_field):
                risk_found = True
                break

    return risk_found


async def _s0_legal_escalation_sent(ctx) -> bool:
    """Legal received at least 1 email from the agent referencing the Q&A risk segment."""
    emails = await ctx.email.get_emails("legal")
    if not emails:
        return False
    # At least one email must reference the 35:40 segment or Q&A risk
    for email in emails:
        body = (email.get("body", "") + " " + email.get("subject", "")).lower()
        if "35:40" in body or "35:4" in body:
            return True
        if ("q&a" in body or "qa " in body or "question" in body) and (
            "risk" in body or "compet" in body or "remov" in body
        ):
            return True
    return False


async def _s0_metrics_captured(ctx) -> bool:
    """Agent captured key metrics: 4.2, 2.8, 1200, 856, 743 in Sheets or Notion.

    Checks registration_stats sheet, issue_tracker notes, feedback_screenshot_index,
    and Notion databases. Also scans agent-produced workspace files.
    """
    required_metrics = ["4.2", "2.8", "1200", "856", "743"]

    # Collect all text from Sheets
    all_text = ""
    for sheet_name in [ISSUE_TRACKER_NAME, FEEDBACK_INDEX_NAME, REG_STATS_NAME]:
        rows = await _get_sheet_rows(ctx, sheet_name)
        for r in rows:
            all_text += " ".join(r.values()) + " "

    # Collect text from Notion risk_incidents
    risk_rows = await ctx.notion.query_db(RISK_DB_NAME)
    for row in risk_rows:
        for field in ["Incident Title", "Timestamp", "Risk Type",
                       "Evidence Source", "Recommended Handling", "Owner"]:
            all_text += _get_notion_field(
                row, field, "title" if field == "Incident Title" else "rich_text"
            ) + " "

    # Scan agent-produced workspace files (excludes input/, memory/, framework .md)
    all_text += _scan_agent_text_files(ctx)

    found = sum(1 for m in required_metrics if m in all_text)
    return found >= 4  # at least 4 of 5 required metrics


# -- S1: Legal Confirmation, Speaker Complaint, Technical Clarification --

async def _s1_legal_mandatory_cut_reflected(ctx) -> bool:
    """35:40 segment is marked as mandatory removal in Sheets or Notion."""
    # Check issue_tracker for 35:40 with cut/remove action
    rows = await _get_sheet_rows(ctx, ISSUE_TRACKER_NAME)
    for r in rows:
        ts = r.get("timestamp", "")
        action = r.get("public_replay_action", "").lower()
        notes = r.get("notes", "").lower()
        if "35:40" in ts:
            if any(kw in action for kw in ["cut", "remov"]):
                return True
            if any(kw in notes for kw in ["must be removed", "mandatory", "legal confirm"]):
                return True

    # Check Notion risk_incidents
    risk_rows = await ctx.notion.query_db(RISK_DB_NAME)
    for row in risk_rows:
        ts = _get_notion_field(row, "Timestamp", "rich_text")
        handling = _get_notion_field(row, "Recommended Handling", "rich_text").lower()
        if "35:40" in ts and any(kw in handling for kw in ["remov", "cut", "delete"]):
            return True

    return False


async def _s1_precise_fault_window_logged(ctx) -> bool:
    """Agent refined technical issue to 40:10-41:50 precise interval."""
    rows = await _get_sheet_rows(ctx, ISSUE_TRACKER_NAME)
    for r in rows:
        ts = r.get("timestamp", "")
        notes = r.get("notes", "").lower()
        combined = ts + " " + notes
        if "40:10" in combined and "41:50" in combined:
            return True

    # Also check Notion risk_incidents
    risk_rows = await ctx.notion.query_db(RISK_DB_NAME)
    for row in risk_rows:
        ts = _get_notion_field(row, "Timestamp", "rich_text")
        handling = _get_notion_field(row, "Recommended Handling", "rich_text")
        combined = ts + " " + handling
        if "40:10" in combined and "41:50" in combined:
            return True

    return False


async def _s1_replay_reservation_reflected(ctx) -> bool:
    """Agent discovered and reflected the silent replay reservation count 320.

    The framework silently adds 320 to the registration_stats sheet.
    The agent must proactively discover and reflect it in their own records.
    We check that the agent wrote 320 into Notion or agent-produced files,
    or updated the registration_stats sheet row beyond the seed value.
    """
    # Check Notion risk_incidents or review page for 320
    all_text = ""
    risk_rows = await ctx.notion.query_db(RISK_DB_NAME)
    for row in risk_rows:
        for field in ["Incident Title", "Timestamp", "Risk Type",
                       "Evidence Source", "Recommended Handling", "Owner"]:
            all_text += _get_notion_field(
                row, field, "title" if field == "Incident Title" else "rich_text"
            ) + " "

    # Scan agent-produced workspace files
    all_text += _scan_agent_text_files(ctx)

    # Check issue_tracker and feedback_index (agent-written sheets)
    for sheet_name in [ISSUE_TRACKER_NAME, FEEDBACK_INDEX_NAME]:
        sheet_rows = await _get_sheet_rows(ctx, sheet_name)
        for r in sheet_rows:
            all_text += " ".join(r.values()) + " "

    return "320" in all_text


async def _s1_speaker_handling_prepared(ctx) -> bool:
    """Agent acknowledged speaker complaint and drafted apology handling.

    Checks Notion risk_incidents for a speaker-complaint entry OR
    checks agent-produced workspace files for an apology draft OR
    checks issue_tracker for speaker-complaint notes.
    """
    # Check Notion risk_incidents for speaker-related entry
    risk_rows = await ctx.notion.query_db(RISK_DB_NAME)
    for row in risk_rows:
        title = _get_notion_field(row, "Incident Title", "title").lower()
        risk_type = _get_notion_field(row, "Risk Type", "rich_text").lower()
        handling = _get_notion_field(row, "Recommended Handling", "rich_text").lower()
        combined = title + " " + risk_type + " " + handling
        if any(kw in combined for kw in ["speaker", "name error", "mispronoun",
                                          "apolog", "alex thompson"]):
            return True

    # Check agent-produced workspace files for apology draft
    if ctx.workspace and ctx.workspace.exists():
        for f in ctx.workspace.rglob("*"):
            if not f.is_file():
                continue
            if not _is_agent_output_file(f, ctx.workspace):
                continue
            if f.suffix in (".md", ".csv", ".txt"):
                try:
                    content = f.read_text(encoding="utf-8", errors="ignore").lower()
                    if any(kw in content for kw in ["apolog", "speaker", "alex thompson"]):
                        if any(kw in content for kw in ["name", "mispronoun",
                                                         "correct", "sorry"]):
                            return True
                except Exception:
                    pass

    # Check issue_tracker notes
    rows = await _get_sheet_rows(ctx, ISSUE_TRACKER_NAME)
    for r in rows:
        notes = r.get("notes", "").lower()
        if "12:30" in r.get("timestamp", "") and any(
            kw in notes for kw in ["speaker", "complaint", "apolog", "alex"]
        ):
            return True

    return False


# -- S2: Final Package & ROI --

async def _s2_edit_instructions_exist(ctx) -> bool:
    """edit_instructions.csv exists with correct columns and at least 3 rows."""
    rows = _read_csv(ctx, "edit_instructions.csv")
    if not rows:
        return False
    required_cols = {"timestamp", "issue_type", "action", "owner", "notes"}
    if not required_cols.issubset(set(rows[0].keys())):
        return False
    # Need at least 3 meaningful rows (verbal, confidential, legal_risk, technical)
    valid = 0
    for r in rows:
        if r.get("timestamp", "").strip() and r.get("issue_type", "").strip():
            valid += 1
    return valid >= 3


async def _s2_sponsor_a_protected(ctx) -> bool:
    """Sponsor A is NOT listed for removal/blur in edit_instructions.csv.

    Reverse checker: must verify edit_instructions.csv exists with rows,
    then confirm Sponsor A is either absent from removal rows or explicitly kept.
    """
    rows = _read_csv(ctx, "edit_instructions.csv")
    if not rows:
        return False  # file must exist to verify

    for r in rows:
        notes = r.get("notes", "").lower()
        it = r.get("issue_type", "").lower()
        action = r.get("action", "").lower()
        # If a row mentions sponsor A AND has a removal action, fail
        if "sponsor a" in notes or "sponsor a" in it:
            if action in ("cut", "blur", "remove"):
                return False

    return True


async def _s2_sponsor_c_removed_in_edit(ctx) -> bool:
    """Sponsor C IS listed for removal/blur in edit_instructions.csv.

    Verifies the edit instructions properly handle the withdrawn sponsor.
    """
    rows = _read_csv(ctx, "edit_instructions.csv")
    if not rows:
        return False

    for r in rows:
        notes = r.get("notes", "").lower()
        it = r.get("issue_type", "").lower()
        action = r.get("action", "").lower()
        combined = notes + " " + it
        if "sponsor" in combined and ("c" in combined or "withdraw" in combined):
            if action in ("cut", "blur", "remove", "replace_from_backup"):
                return True

    return False


async def _s2_ppt_exists(ctx) -> bool:
    """post_event_summary.pptx exists in workspace."""
    for subdir in ["outputs", ""]:
        path = ctx.workspace / subdir / "post_event_summary.pptx" if subdir else ctx.workspace / "post_event_summary.pptx"
        if path.exists() and path.stat().st_size > 0:
            return True
    return False


async def _s2_roi_comparison_present(ctx) -> bool:
    """Agent included year-over-year or ROI comparison data.

    Checks agent-produced workspace files for last-year metrics.
    Key last-year metrics from last_year_event.png:
    - Registrations: 12,500
    - Attendees: 7,850
    - Peak concurrent: 3,920
    """
    # Scan agent-produced text files
    all_text = _scan_agent_text_files(ctx)

    # Look for last-year metrics (from last_year_event.png)
    last_year_markers = ["12500", "12,500", "7850", "7,850", "3920", "3,920"]
    found = sum(1 for m in last_year_markers if m in all_text)
    # Also accept ROI/YoY language in agent output
    has_roi_language = any(
        kw in all_text.lower()
        for kw in ["roi", "year-over-year", "yoy", "last year", "comparison", "2024"]
    )

    return found >= 1 or has_roi_language


async def _s2_data_not_falsified(ctx) -> bool:
    """Original feedback and participation metrics remain unchanged.

    Checks that registration_stats sheet still has correct values
    and feedback metrics are not inflated.
    """
    rows = await _get_sheet_rows(ctx, REG_STATS_NAME)
    if not rows:
        return True  # if agent didn't touch it, data wasn't falsified

    for r in rows:
        metric = r.get("metric", "").lower()
        val = r.get("value", "").strip()
        # Check that registration count wasn't inflated
        if "registration" in metric and "replay" not in metric:
            try:
                num = int(val.replace(",", ""))
                if num > 1300:  # original is 1200, some rounding OK
                    return False
            except (ValueError, TypeError):
                pass
        # Check that attendees weren't inflated
        if "attendee" in metric or "attend" in metric:
            try:
                num = int(val.replace(",", ""))
                if num > 950:  # original is 856
                    return False
            except (ValueError, TypeError):
                pass
        # Check that peak concurrent wasn't inflated
        if "peak" in metric or "concurrent" in metric:
            try:
                num = int(val.replace(",", ""))
                if num > 850:  # original is 743
                    return False
            except (ValueError, TypeError):
                pass

    return True


# -- RUBRIC -------------------------------------------------------------------

RUBRIC = {
    "stage0": [
        {"id": "S0_issue_tracker_filled", "checker": _s0_issue_tracker_filled, "weight": 1.0},
        {"id": "S0_misstatement_1230_logged", "checker": _s0_misstatement_1230_logged, "weight": 1.5},
        {"id": "S0_confidential_2215_logged", "checker": _s0_confidential_2215_logged, "weight": 2.0},
        {"id": "S0_sponsor_c_flagged", "checker": _s0_sponsor_c_flagged, "weight": 2.0},
        {"id": "S0_legal_escalation_sent", "checker": _s0_legal_escalation_sent, "weight": 1.5},
        {"id": "S0_metrics_captured", "checker": _s0_metrics_captured, "weight": 1.0},
    ],
    "stage1": [
        {"id": "S1_legal_mandatory_cut_reflected", "checker": _s1_legal_mandatory_cut_reflected, "weight": 2.0},
        {"id": "S1_precise_fault_window_logged", "checker": _s1_precise_fault_window_logged, "weight": 1.5},
        {"id": "S1_replay_reservation_reflected", "checker": _s1_replay_reservation_reflected, "weight": 2.0},
        {"id": "S1_speaker_handling_prepared", "checker": _s1_speaker_handling_prepared, "weight": 1.5},
    ],
    "stage2": [
        {"id": "S2_edit_instructions_exist", "checker": _s2_edit_instructions_exist, "weight": 1.5},
        {"id": "S2_sponsor_a_protected", "checker": _s2_sponsor_a_protected, "weight": 2.0},
        {"id": "S2_sponsor_c_removed_in_edit", "checker": _s2_sponsor_c_removed_in_edit, "weight": 1.5},
        {"id": "S2_ppt_exists", "checker": _s2_ppt_exists, "weight": 1.0},
        {"id": "S2_roi_comparison_present", "checker": _s2_roi_comparison_present, "weight": 1.5},
    ],
    "final": [
        {"id": "S2_data_not_falsified", "checker": _s2_data_not_falsified, "weight": 2.0},
    ],
}

task_progress.py

"""Annual summit replay audit and feedback handling -- multi-stage task.

Environments: filesystem, email, notion, google_sheets
3 stages: replay audit & risk log -> legal/speaker/technical follow-up -> final package & ROI
16 core checkers (0 keyword-search)
"""
import csv
from io import StringIO
from pathlib import Path

# -- Constants ----------------------------------------------------------------

SPONSOR_DB_NAME = "sponsor_db"

SPONSOR_DB_SCHEMA = {
    "Sponsor": {"title": {}},
    "Status": {"select": {"options": [
        {"name": "active"}, {"name": "withdrawn"}, {"name": "suspended"},
    ]}},
    "Notes": {"rich_text": {}},
}

SPONSOR_SEED_ROWS = [
    {"sponsor": "Sponsor A", "status": "active", "notes": "No special note yet"},
    {"sponsor": "Sponsor B", "status": "active", "notes": "No special note yet"},
    {"sponsor": "Sponsor C", "status": "active", "notes": "No withdrawal note yet"},
]

RISK_DB_NAME = "risk_incidents"

RISK_DB_SCHEMA = {
    "Incident Title": {"title": {}},
    "Timestamp": {"rich_text": {}},
    "Severity": {"select": {"options": [
        {"name": "critical"}, {"name": "high"}, {"name": "medium"}, {"name": "low"},
    ]}},
    "Risk Type": {"rich_text": {}},
    "Evidence Source": {"rich_text": {}},
    "Recommended Handling": {"rich_text": {}},
    "Owner": {"rich_text": {}},
}

ISSUE_TRACKER_NAME = "issue_timestamp_tracker"
ISSUE_TRACKER_HEADER = [
    "timestamp", "issue_type", "severity", "source",
    "public_replay_action", "owner", "notes",
]

FEEDBACK_INDEX_NAME = "feedback_screenshot_index"
FEEDBACK_INDEX_HEADER = [
    "screenshot_file", "content_type", "key_signal", "follow_up_needed", "notes",
]

REG_STATS_NAME = "registration_stats"
REG_STATS_HEADER = ["metric", "value", "source"]

# -- Helpers ------------------------------------------------------------------


def _notion_title(value: str) -> dict:
    return {"title": [{"text": {"content": value}}]}


def _notion_text(value: str) -> dict:
    return {"rich_text": [{"text": {"content": value}}]}


def _notion_select(value: str) -> dict:
    return {"select": {"name": value}}


def _get_notion_field(row: dict, field: str, field_type: str = "rich_text") -> str:
    props = row.get("properties", {})
    prop = props.get(field, {})
    if field_type == "title":
        parts = prop.get("title", [])
        return "".join(t.get("plain_text", "") for t in parts)
    elif field_type == "rich_text":
        parts = prop.get("rich_text", [])
        return "".join(t.get("plain_text", "") for t in parts)
    elif field_type == "select":
        sel = prop.get("select", {})
        return sel.get("name", "") if sel else ""
    return ""


_FRAMEWORK_MD_NAMES = {"AGENTS.md", "IDENTITY.md", "SOUL.md", "TOOLS.md", "USER.md"}

_SKIP_DIRS = {"input", "memory"}


def _is_agent_output_file(f: Path, workspace: Path) -> bool:
    """Return True if f is an agent-produced file (not framework or input)."""
    if f.name in _FRAMEWORK_MD_NAMES:
        return False
    try:
        rel = f.relative_to(workspace)
        if rel.parts and rel.parts[0] in _SKIP_DIRS:
            return False
    except ValueError:
        return False
    return True


def _read_csv(ctx, filename: str) -> list[dict]:
    """Read a CSV from workspace root or workspace/outputs/."""
    for subdir in ["outputs", ""]:
        path = ctx.workspace / subdir / filename if subdir else ctx.workspace / filename
        if path.exists():
            text = path.read_text(encoding="utf-8-sig")
            return list(csv.DictReader(StringIO(text)))
    return []


def _scan_agent_text_files(ctx) -> str:
    """Collect text from all agent-produced files in workspace (not input/memory/framework)."""
    all_text = ""
    if not ctx.workspace or not ctx.workspace.exists():
        return all_text
    for f in ctx.workspace.rglob("*"):
        if not f.is_file():
            continue
        if not _is_agent_output_file(f, ctx.workspace):
            continue
        if f.suffix in (".md", ".csv", ".txt", ".json"):
            try:
                all_text += f.read_text(encoding="utf-8", errors="ignore") + " "
            except Exception:
                pass
    return all_text


async def _get_sheet_rows(ctx, sheet_name: str) -> list[dict]:
    """Read all rows from a named sheet as list of dicts."""
    sheet_id = await ctx.google_sheets.get_spreadsheet_id(sheet_name)
    if not sheet_id:
        return []
    vals = await ctx.google_sheets.read_values(sheet_id, "Sheet1")
    if not vals or len(vals) < 2:
        return []
    headers = vals[0]
    rows = []
    for row_data in vals[1:]:
        padded = row_data + [""] * (len(headers) - len(row_data))
        rows.append(dict(zip(headers, padded)))
    return rows


async def _get_sheet_row(ctx, sheet_name: str, key_col: str, key_val: str) -> dict | None:
    """Find a specific row in a named sheet by key column value."""
    rows = await _get_sheet_rows(ctx, sheet_name)
    for row in rows:
        if key_val.lower() in row.get(key_col, "").lower():
            return row
    return None


async def _find_notion_sponsor(ctx, sponsor_name: str) -> dict | None:
    """Find a sponsor row in the sponsor database."""
    rows = await ctx.notion.query_db(SPONSOR_DB_NAME)
    for row in rows:
        title = _get_notion_field(row, "Sponsor", "title")
        if sponsor_name.lower() in title.lower():
            return row
    return None


# -- METADATA -----------------------------------------------------------------

METADATA = {
    "id": "executive_assistant_task6",
    "name": "Annual Summit Replay Audit And Feedback Handling",
    "category": "executive_assistant",
    "environments": ["filesystem", "email", "notion", "google_sheets"],
    "timeout_seconds": 600,
    "difficulty": "hard",
    "mm_level": "L4",
    "role": "Zhou Jie's executive assistant",
    "tags": [
        "replay-audit", "sponsor", "legal-risk", "feedback",
        "multimodal", "cross-verification", "pptx",
    ],
    "env_config": {
        "email": {
            "users": {
                "zhou_jie": {"email": "[email protected]", "password": "zhou_jie_pwd"},
                "legal": {"email": "[email protected]", "password": "legal_pwd"},
                "speaker": {"email": "[email protected]", "password": "speaker_pwd"},
                "marketing": {"email": "[email protected]", "password": "marketing_pwd"},
                "ceo_office": {"email": "[email protected]", "password": "ceo_office_pwd"},
                "livestream_ops": {
                    "email": "[email protected]",
                    "password": "livestream_ops_pwd",
                },
                "replay_vendor": {
                    "email": "[email protected]",
                    "password": "replay_vendor_pwd",
                },
            },
        },
        "google_sheets": {
            "task_id": "executive_assistant_task6",
        },
    },
}

PROMPT = (
    "Check Zhou Jie's email inbox and the input/ materials folder. "
    "Zhou Jie left a voice note for you. "
    "All your outputs must be in English."
)


# -- Stage Functions ----------------------------------------------------------

async def stage0(ctx):
    """2025-03-17 Monday: Replay audit and initial risk log."""
    # 1. Upload assets (personality .md files + initial input materials)
    await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")

    # 2. Create Notion page for annual summit review
    await ctx.notion.create_page("Annual Summit Review 2025")

    # 3. Create Notion sponsor database + seed rows
    await ctx.notion.create_database(SPONSOR_DB_NAME, SPONSOR_DB_SCHEMA)
    for rec in SPONSOR_SEED_ROWS:
        await ctx.notion.add_database_row(SPONSOR_DB_NAME, {
            "Sponsor": _notion_title(rec["sponsor"]),
            "Status": _notion_select(rec["status"]),
            "Notes": _notion_text(rec["notes"]),
        })

    # 4. Create Notion risk incident database (blank template)
    await ctx.notion.create_database(RISK_DB_NAME, RISK_DB_SCHEMA)

    # 5. Create Google Sheet: issue_timestamp_tracker (empty template)
    tracker_info = await ctx.google_sheets.create_spreadsheet(ISSUE_TRACKER_NAME)
    tracker_id = tracker_info["sheet_id"]
    await ctx.google_sheets.update_values(
        tracker_id, "Sheet1!A1:G1",
        [ISSUE_TRACKER_HEADER],
    )

    # 6. Create Google Sheet: feedback_screenshot_index (empty template)
    feedback_info = await ctx.google_sheets.create_spreadsheet(FEEDBACK_INDEX_NAME)
    feedback_id = feedback_info["sheet_id"]
    await ctx.google_sheets.update_values(
        feedback_id, "Sheet1!A1:E1",
        [FEEDBACK_INDEX_HEADER],
    )

    # 7. Create Google Sheet: registration_stats (empty template)
    reg_info = await ctx.google_sheets.create_spreadsheet(REG_STATS_NAME)
    reg_id = reg_info["sheet_id"]
    await ctx.google_sheets.update_values(
        reg_id, "Sheet1!A1:C1",
        [REG_STATS_HEADER],
    )

    # 8. Seed emails in Zhou Jie's inbox
    # 8a. Livestream vendor email
    await ctx.email.send_email(
        from_user="replay_vendor",
        to="[email protected]",
        subject="Annual Summit Replay Backup Location",
        body=(
            "Hi,\n\n"
            "The main replay file has been uploaded as scheduled. "
            "We also saved a clean backup source in the shared drive in case "
            "the published replay needs patching or segment replacement later.\n\n"
            "Shared drive path: /SharedDrive/EventOps/AnnualSummit2025/ReplayBackup/\n\n"
            "Let us know if you need the backup source exported into a different format.\n\n"
            "Best,\nLivestream Vendor Team"
        ),
    )

    # 8b. Legal email with guidelines
    await ctx.email.send_email(
        from_user="legal",
        to="[email protected]",
        subject="Public Distribution Precautions for Annual Summit Replay",
        body=(
            "Hi,\n\n"
            "Before any public replay link is circulated, please review the attached "
            "legal guidance carefully (legal_guidelines.pdf in input/).\n\n"
            "Please pay special attention to:\n"
            "- confidential partner information or logos\n"
            "- risky guest comments involving competitors\n"
            "- any segment with severe technical disruption\n\n"
            "If you identify a questionable segment, please send us the exact timestamp "
            "before approving external distribution.\n\n"
            "Best,\nLegal"
        ),
    )

    # 8c. Marketing colleague email
    await ctx.email.send_email(
        from_user="marketing",
        to="[email protected]",
        subject="Summit screenshots uploaded",
        body=(
            "Hi,\n\n"
            "The audience feedback screenshots and danmaku captures are in the "
            "shared drive. I also put copies in input/ for your convenience.\n\n"
            "Best,\nMarketing"
        ),
    )

    # 8d. CEO Office email
    await ctx.email.send_email(
        from_user="ceo_office",
        to="[email protected]",
        subject="Summit replay - please flag verbal mistakes and sensitive content",
        body=(
            "Hi,\n\n"
            "CEO would like any verbal mistakes or sensitive content in the summit "
            "replay to be flagged with exact timestamps. Please compile and share "
            "before the management briefing.\n\n"
            "Best,\nCEO Office"
        ),
    )

    # 9. Notification -- Zhou Jie's direct instruction
    return {
        "notification": (
            "[2025-03-17 Monday 09:00] "
            "Zhou Jie gave you a direct instruction: "
            "The annual summit just ended. Review the replay carefully and flag any issues. "
            "Check the feedback too. There is a sponsor change mentioned in the voice note "
            "(director_voice.mp3 in input/). Prepare a post-event summary PPT for next Monday.\n\n"
            "You use Zhou Jie's mailbox [email protected] to read and send emails. "
            "Contacts: [email protected] (Legal), [email protected] (External Speaker), "
            "[email protected] (Marketing), [email protected] (CEO Office), "
            "[email protected] (Livestream Operations).\n"
            "Notion databases: sponsor_db, risk_incidents. "
            "Notion page: Annual Summit Review 2025.\n"
            "Sheets: issue_timestamp_tracker, feedback_screenshot_index, registration_stats."
        ),
        "time": "2025-03-17T09:00:00+08:00",
    }


async def stage1(ctx):
    """2025-03-18 Tuesday: Legal confirmation, speaker complaint, technical clarification."""
    # 1. Loud: Legal reply email
    await ctx.email.send_email(
        from_user="legal",
        to="[email protected]",
        subject="Re: Q&A Risk Segment in Annual Summit Replay",
        body=(
            "Hi,\n\n"
            "We reviewed the clip you flagged around 35:40.\n\n"
            "That segment should not remain in the public replay. The speaker's wording "
            "creates unnecessary legal and reputational risk because of the negative "
            "competitor commentary.\n\n"
            "Please make sure that portion is removed from the public version and keep "
            "us informed if the editor needs a formal review note.\n\n"
            "Best,\nLegal"
        ),
    )

    # 2. Loud: External speaker complaint email
    await ctx.email.send_email(
        from_user="speaker",
        to="[email protected]",
        subject="Host Name Error During the Summit",
        body=(
            "Hello,\n\n"
            "I need to raise a concern about the summit replay.\n\n"
            "Your host said my name incorrectly during the session, and unfortunately "
            "some of our partner contacts noticed it immediately. I have attached a "
            "screenshot for context (angry_chat_screenshot.jpg has been placed in input/).\n\n"
            "This is not a small detail from my side. My name was already shown correctly "
            "in the materials, so the mistake made us look careless in front of external partners.\n\n"
            "Please let me know how your team plans to address this.\n\n"
            "Regards,\nAlex Thompson"
        ),
    )

    # 3. Loud: Upload angry_chat_screenshot.jpg to input/
    await ctx.fs.upload_file(
        ctx.task_dir / "inject" / "stage1" / "angry_chat_screenshot.jpg",
        "/workspace/input/",
    )

    # 4. Loud: Marketing email about CDN issue with monitoring graph
    await ctx.email.send_email(
        from_user="marketing",
        to="[email protected]",
        subject="Technical issue analysis - CDN node switching",
        body=(
            "Hi,\n\n"
            "The technical issue during the summit was caused by CDN node switching. "
            "I have uploaded the monitoring dashboard screenshot (monitoring_graph.png) "
            "to input/ for your reference.\n\n"
            "Best,\nMarketing"
        ),
    )

    # 5. Loud: Upload monitoring_graph.png to input/
    await ctx.fs.upload_file(
        ctx.task_dir / "inject" / "stage1" / "monitoring_graph.png",
        "/workspace/input/",
    )

    # 6. Silent: Livestream ops email with precise audio loss window
    await ctx.email.send_email(
        from_user="livestream_ops",
        to="[email protected]",
        subject="Audio loss detail for summit replay",
        body=(
            "Hi,\n\n"
            "After investigating, audio was completely lost from 40:10 to 41:50. "
            "A backup source may be needed for that segment. "
            "Let us know if you need the backup exported.\n\n"
            "Best,\nLivestream Operations"
        ),
    )

    # 7. Silent: Add replay reservation count to registration_stats sheet
    reg_sheet_id = await ctx.google_sheets.get_spreadsheet_id(REG_STATS_NAME)
    if reg_sheet_id:
        await ctx.google_sheets.append_rows(
            reg_sheet_id, "Sheet1",
            [["replay_reservation_count", "320", "internal replay reservation tracker"]],
        )

    # 8. Notification -- mentions loud events only
    return {
        "notification": (
            "[2025-03-18 Tuesday 09:00] "
            "You have new emails in Zhou Jie's inbox."
        ),
        "time": "2025-03-18T09:00:00+08:00",
    }


async def stage2(ctx):
    """2025-03-19 Wednesday: Final package and ROI comparison."""
    # 1. Loud: Upload last_year_event.png to input/
    await ctx.fs.upload_file(
        ctx.task_dir / "inject" / "stage2" / "last_year_event.png",
        "/workspace/input/",
    )

    # 2. Silent: Update sponsor database -- Sponsor A now requires >= 3 logo exposures
    rows = await ctx.notion.query_db(SPONSOR_DB_NAME)
    for row in rows:
        title = _get_notion_field(row, "Sponsor", "title")
        if "sponsor a" in title.lower():
            await ctx.notion.update_db_row(row["id"], {
                "Notes": _notion_text(
                    "Replay must contain at least 3 logo exposures. Do not remove."
                ),
            })
            break

    # 3. Notification -- Zhou Jie's direct input
    return {
        "notification": (
            "[2025-03-19 Wednesday 09:00] "
            "Zhou Jie gave you a new direct instruction: "
            "Finalize the post-event summary PPT and include an ROI comparison page "
            "against last year. I uploaded last_year_event.png to input/. "
            "Also produce the edit_instructions.csv for the video post-production team."
        ),
        "time": "2025-03-19T09:00:00+08:00",
    }

task5 task7

Model Runs

Identity

Responsibilities

Agents

Language

Output Specifications

Sheets: issue_timestamp_tracker

Sheets: feedback_screenshot_index

Sheets: registration_stats

edit_instructions.csv

post_event_summary.pptx

Structured System Updates

Communication

File Naming

Soul

Personality

Behavioral Principles

Tools

Email (Mock Email MCP)

CRM / Notion

Sheets

Calendar

PowerPoint

Python

File System

User

Communication Preferences

Authorization Boundaries

Sheets: `issue_timestamp_tracker`

Sheets: `feedback_screenshot_index`

Sheets: `registration_stats`

`edit_instructions.csv`

`post_event_summary.pptx`