Roles/ecommerce/task1

task_summary.txtE-Commerce · task1

Batch defect triage and supplier escalation for Chen Jie's QC assistant. Mon 3/18: familiarize, then complaint files arrive. Tue 3/19: Supplier A sends sub-batch detail; fresh complaints and a lot-240302 return-rate uptick appear. Mon 3/25: escalate to procurement director Wang Lei and file the weekly report.

Model Runs

5 models evaluated on this task, 3 independent runs each.

Model	Score (Avg@3)	Run 1	Run 2	Run 3
MiniMax M2.7 MiniMax	17.5%	22.5%	15.0%	15.0%
Qwen3.6 Plus Alibaba	16.7%	20.0%	10.0%	20.0%
Claude Sonnet 4.6 Anthropic	15.0%	15.0%	5.0%	25.0%
Gemini 3.1 Pro Preview Google	15.0%	20.0%	5.0%	20.0%
GPT-5.4 OpenAI	11.7%	15.0%	10.0%	10.0%

Input Files5

📄brand_guidelines.pdf

Download

📁orders.db

Download

📄returns_policy_v2.pdf

Download

📊sales_by_lot.xlsx

Download

📝sku_lot_map.csv

Download

IDENTITY.md

Name: Xiao An (An)
Role: Digital assistant for the Curato QC team

AGENTS.md

Every Session

Read SOUL.md
Read USER.md
Read memory/ for recent context (if exists)

Memory

You wake up fresh each session. These files are your continuity:

memory/YYYY-MM-DD.md — raw logs of what happened
MEMORY.md — your curated long-term memory

Capture what matters. If you want to remember something, WRITE IT TO A FILE.

"Mental notes" don't survive session restarts. Files do.

Safety

Safe to do freely: read files, query databases, compute, organize within workspace, send emails to internal team and suppliers, update Notion/Sheets, classify complaints
Requires manager approval: compensation amounts > ¥500, batch recall decisions, any direct communication with consumers
Never do: count competitor or invalid complaints in QC statistics

Your Work

You assist the QC team with day-to-day quality control, return disputes, and recall management.

Your work involves multiple information sources: Feishu, Email, Notion, Sheets, filesystem, database. These systems may change at any time. The complaints folder may receive new files, the inbox may get replies.

Output Specification

All output files go in the workspace root directory under outputs/.

Complaint Analysis Report (Task 1)

Filename: complaint_analysis.csv

Column	Description
complaint_id	Complaint ID (starting from CMP-001)
source	Source (review_screenshot / return_photo / video / audio / json)
order_id	Associated order ID
sku	SKU number
lot_id	Lot number
damage_type	Defect type (scratch / crack / discolor / seal_broken / functional / other)
severity	Severity (minor / major / critical)
is_valid	Valid complaint (YES / NO / PENDING)
root_cause	Root cause attribution (manufacturing / transport / storage / packaging / user_misuse / unknown)
exclusion_reason	Exclusion reason (if any)
notes	Notes

SOUL.md

You

Detail-oriented and rigorous. You excel at spotting inconsistencies across data and evidence. Cross-modal information verification is your core competency — when different sources disagree, you notice. Honest and compliant — you never ignore anomalies to make the data look better.

Work Style

Stay in character but stay focused on the job. No distractions during work. Base every judgment on file evidence, never on memory alone.

Communication

Speak with data and give clear conclusions. When reporting issues, lead with the conclusion, then provide the evidence chain. If one sentence covers it, don't split it into three.

Trust

QC work involves supplier relationships and consumer rights — stay fair and objective. External actions — emailing suppliers, requesting warehouse freezes — pause and consider whether to send and whether the wording is appropriate. Internal actions — reading files, querying data, analyzing, organizing — go ahead confidently.

TOOLS.md

Available Environments

Email

Your address: [email protected]
Chen Jie (QC Specialist): [email protected]
Zhao Qiang (QC Manager): [email protected]
Supplier A (Mingfeng): [email protected]
Wang Lei (Procurement Director): [email protected]
Warehouse: [email protected]

Notion

Database: qc_events — Quality event case tracking

Google Sheets

Spreadsheet: return_rates — Return rate tracking per lot

Filesystem

/workspace/input/ — Reference documents and input materials
- returns_policy_v2.pdf — Return & QC policy
- brand_guidelines.pdf — Brand visual identity
- sales_by_lot.xlsx — Sales data by lot
- sku_lot_map.csv — SKU to lot mapping
- orders.db — SQLite order database (~2000 rows), query with sqlite3
/workspace/input/complaints/ — Complaint files (may receive new files over time)

Terminal

Full bash access for computation, SQLite queries, file processing

USER.md

Team: Quality Control (QC) Team
Manager: Zhao Qiang (QC Manager) — oversees quality control and recalls
Members:
- Chen Jie (QC Specialist): day-to-day QC, review monitoring, supplier communication
- Liu Fang (After-Sales Supervisor): return dispute arbitration, customer disputes
Communication: Feishu for internal chat; email for external communication with suppliers, warehouses, and legal
Management style: Delegating — routine classification, data cleanup, and notification dispatch need no approval, but the following must be reported first:
- Classifying an issue as a batch quality event (case filing)
- Compensation amounts > ¥500 or involving legal
- Recall / freeze decisions
- Any direct communication with consumers

task_checker.py

# ── Checker Functions ─────────────────────────────────────────────

# -- S1: Complaint Analysis --


async def _s1_competitor_excluded(ctx):
    """review_01.png complaint marked as invalid (competitor product)."""
    rows = _read_csv(ctx, "complaint_analysis.csv")
    # review_01.png shows a product with AETERNA (competitor) logo
    row = _find_csv_row(rows, "source", "review_01")
    if not row:
        return False
    return row.get("is_valid", "").upper() == "NO"


async def _s1_dedup_applied(ctx):
    """No duplicate order_ids among valid complaints in CSV."""
    rows = _read_csv(ctx, "complaint_analysis.csv")
    valid_order_ids = [
        r.get("order_id", "").strip()
        for r in rows
        if r.get("is_valid", "").upper() == "YES"
        and r.get("order_id", "").strip()
    ]
    return len(valid_order_ids) == len(set(valid_order_ids))


async def _s1_denominator_corrected(ctx):
    """Sheets lot-240301 shipped_units corrected from 1200 to ~1082."""
    sheet_id = await ctx.google_sheets.get_spreadsheet_id("return_rates")
    if not sheet_id:
        return False
    vals = await ctx.google_sheets.read_values(sheet_id, "Sheet1!A2:C4")
    if not vals:
        return False
    for row in vals:
        if len(row) >= 3 and row[0] == "lot-240301":
            try:
                shipped = int(str(row[2]).replace(",", ""))
                # Should be ~1082 (1200 - 118 cancelled), tolerance ±40
                return 1050 <= shipped <= 1120
            except (ValueError, IndexError):
                return False
    return False


async def _s1_crack_root_cause(ctx):
    """Lot-240301 crack attributed to transport/storage (cross-modal reasoning)."""
    rows = _read_csv(ctx, "complaint_analysis.csv")
    for row in rows:
        lot = row.get("lot_id", "").lower()
        is_valid = row.get("is_valid", "").upper()
        damage = row.get("damage_type", "").lower()
        root = row.get("root_cause", "").lower()
        if ("240301" in lot and is_valid == "YES" and damage == "crack"
                and root in ("transport", "storage")):
            return True
    return False


async def _s1_notion_created(ctx):
    """Notion qc_events has at least 1 quality event record."""
    rows = await ctx.notion.query_db(QC_DB_NAME)
    return len(rows) >= 1


# -- S2: Supplier Follow-up --


async def _s2_sub_lot_split(ctx):
    """CSV or Notion reflects sub-batch split (A1/A2) for lot-240301."""
    # Check CSV for sub-lot references
    rows = _read_csv(ctx, "complaint_analysis.csv")
    for row in rows:
        lot = row.get("lot_id", "").strip().lower()
        notes = row.get("notes", "").lower()
        if any(tag in lot for tag in ("a1", "a2", "sub")):
            return True
        if "240301" in lot and ("a1" in notes or "a2" in notes):
            return True
    # Check Notion for sub-lot records
    notion_rows = await ctx.notion.query_db(QC_DB_NAME)
    for row in notion_rows:
        lot_val = _get_notion_field(row, "lot_id").lower()
        notes_val = _get_notion_field(row, "notes").lower()
        if any(tag in lot_val for tag in ("a1", "a2", "sub")):
            return True
        if "a1" in notes_val or "a2" in notes_val:
            return True
    return False


async def _s2_new_complaints_found(ctx):
    """Agent discovered silently injected complaint files (review_04, call_03)."""
    rows = _read_csv(ctx, "complaint_analysis.csv")
    for r in rows:
        source = r.get("source", "").lower()
        lot = r.get("lot_id", "").lower()
        if "review_04" in source or "call_03" in source:
            return True
        # review_04.png is for lot-240302 — if new 302 entries appeared
        if "240302" in lot and "review" in source:
            return True
    return False


async def _s2_lot302_rate_noted(ctx):
    """Agent reacted to lot-240302 return rate increase — created QC event."""
    # Require Notion event (CSV entry alone could be from stage1 complaint processing)
    notion_rows = await ctx.notion.query_db(QC_DB_NAME)
    for row in notion_rows:
        lot = _get_notion_field(row, "lot_id")
        if "240302" in lot:
            return True
    # Also accept: CSV has lot-240302 entry with severity >= major
    rows = _read_csv(ctx, "complaint_analysis.csv")
    for r in rows:
        if "240302" in r.get("lot_id", ""):
            sev = r.get("severity", "").lower()
            if sev in ("major", "critical"):
                return True
    return False


async def _s2_supplier_report_sent(ctx):
    """Supplier A received a quality report email (body > 100 chars)."""
    emails = await ctx.email.get_emails("supplier_a")
    for e in emails:
        if len(e.get("body", "")) > 100:
            return True
    return False


async def _s2_chenjie_updated(ctx):
    """Chen Jie received at least 1 update email from agent."""
    emails = await ctx.email.get_emails("chenjie")
    return len(emails) >= 1


# -- S3: Escalation & Weekly Report --


async def _s3_escalation_cc_wanglei(ctx):
    """Wang Lei (procurement director) received CC'd escalation email."""
    emails = await ctx.email.get_emails("wanglei")
    return len(emails) >= 1


async def _s3_notion_escalated(ctx):
    """At least 1 Notion qc_events record has status 'escalated'."""
    rows = await ctx.notion.query_db(QC_DB_NAME)
    for row in rows:
        status = _get_notion_field(row, "status", "select").lower()
        if status in ("escalated", "escalating"):
            return True
    return False


async def _s3_weekly_report_sent(ctx):
    """Zhao Qiang received at least 1 email (weekly report)."""
    emails = await ctx.email.get_emails("zhaoqiang")
    return len(emails) >= 1


async def _s3_supplier_reminded(ctx):
    """Supplier A received ≥2 emails (quality report + escalation reminder)."""
    emails = await ctx.email.get_emails("supplier_a")
    return len(emails) >= 2


async def _s3_sheets_updated(ctx):
    """Sheets return_rates has been updated beyond initial seed."""
    sheet_id = await ctx.google_sheets.get_spreadsheet_id("return_rates")
    if not sheet_id:
        return False
    vals = await ctx.google_sheets.read_values(sheet_id, "Sheet1!A1:E10")
    if not vals:
        return False
    # Either more rows than original (header + 3), or lot-240301 corrected
    if len(vals) > 4:
        return True
    for row in vals[1:]:
        if len(row) >= 3 and row[0] == "lot-240301":
            try:
                shipped = int(str(row[2]).replace(",", ""))
                if shipped != 1200:
                    return True
            except ValueError:
                pass
    return False


# ── RUBRIC ────────────────────────────────────────────────────────

RUBRIC = {
    "stage1": [
        {"id": "S1_competitor_excluded",
         "checker": _s1_competitor_excluded, "weight": 2.0},
        {"id": "S1_dedup_applied",
         "checker": _s1_dedup_applied, "weight": 1.0},
        {"id": "S1_denominator_corrected",
         "checker": _s1_denominator_corrected, "weight": 2.0},
        {"id": "S1_crack_root_cause",
         "checker": _s1_crack_root_cause, "weight": 1.5},
        {"id": "S1_notion_created",
         "checker": _s1_notion_created, "weight": 1.0},
    ],
    "stage2": [
        {"id": "S2_sub_lot_split",
         "checker": _s2_sub_lot_split, "weight": 1.5},
        {"id": "S2_new_complaints_found",
         "checker": _s2_new_complaints_found, "weight": 1.5},
        {"id": "S2_lot302_rate_noted",
         "checker": _s2_lot302_rate_noted, "weight": 1.0},
        {"id": "S2_supplier_report_sent",
         "checker": _s2_supplier_report_sent, "weight": 1.0},
        {"id": "S2_chenjie_updated",
         "checker": _s2_chenjie_updated, "weight": 1.0},
    ],
    "stage3": [
        {"id": "S3_escalation_cc_wanglei",
         "checker": _s3_escalation_cc_wanglei, "weight": 2.0},
        {"id": "S3_notion_escalated",
         "checker": _s3_notion_escalated, "weight": 1.0},
        {"id": "S3_weekly_report_sent",
         "checker": _s3_weekly_report_sent, "weight": 1.0},
        {"id": "S3_supplier_reminded",
         "checker": _s3_supplier_reminded, "weight": 1.5},
        {"id": "S3_sheets_updated",
         "checker": _s3_sheets_updated, "weight": 1.0},
    ],
}

task_progress.py

"""Batch defect assessment & supplier escalation — multi-environment multi-stage task.

Environments: filesystem, email, notion, google_sheets
4 stages: familiarize → complaint analysis → supplier follow-up → escalation & weekly report
15 core checkers (0 keyword-search)
"""
import csv
from io import StringIO

# ── Constants ─────────────────────────────────────────────────────

QC_DB_NAME = "qc_events"

QC_DB_SCHEMA = {
    "event_id": {"title": {}},
    "lot_id": {"rich_text": {}},
    "sku": {"rich_text": {}},
    "supplier": {"rich_text": {}},
    "event_type": {"select": {"options": [
        {"name": "complaint"}, {"name": "return_spike"},
        {"name": "inspection"},
    ]}},
    "status": {"select": {"options": [
        {"name": "open"}, {"name": "investigating"},
        {"name": "escalated"}, {"name": "resolved"}, {"name": "closed"},
    ]}},
    "severity": {"select": {"options": [
        {"name": "low"}, {"name": "medium"},
        {"name": "high"}, {"name": "critical"},
    ]}},
    "return_rate": {"number": {}},
    "created_date": {"rich_text": {}},
    "notes": {"rich_text": {}},
}

RETURN_RATES_HEADER = [
    "lot_id", "sku", "shipped_units", "return_count", "return_rate",
]
RETURN_RATES_ROWS = [
    ["lot-240301", "HW-K2201", "1200", "18", "1.50%"],
    ["lot-240302", "HW-K2201", "800", "9", "1.13%"],
    ["lot-240315", "HW-K2202", "600", "5", "0.83%"],
]


# ── Helpers ───────────────────────────────────────────────────────

def _notion_title(value: str) -> dict:
    return {"title": [{"text": {"content": value}}]}


def _notion_text(value: str) -> dict:
    return {"rich_text": [{"text": {"content": value}}]}


def _notion_select(value: str) -> dict:
    return {"select": {"name": value}}


def _notion_number(value) -> dict:
    return {"number": value}


def _read_csv(ctx, filename: str) -> list[dict]:
    path = ctx.workspace / "outputs" / filename
    if not path.exists():
        return []
    text = path.read_text(encoding="utf-8-sig")
    return list(csv.DictReader(StringIO(text)))


def _find_csv_row(rows: list[dict], column: str, search: str) -> dict | None:
    """Find a CSV row where *column* contains *search* (case-insensitive)."""
    for row in rows:
        val = row.get(column, "")
        if search.lower() in val.lower():
            return row
    return None


def _get_notion_field(row: dict, field: str, field_type: str = "rich_text") -> str:
    props = row.get("properties", {})
    prop = props.get(field, {})
    if field_type == "title":
        parts = prop.get("title", [])
        return "".join(t.get("plain_text", "") for t in parts)
    elif field_type == "rich_text":
        parts = prop.get("rich_text", [])
        return "".join(t.get("plain_text", "") for t in parts)
    elif field_type == "select":
        sel = prop.get("select", {})
        return sel.get("name", "") if sel else ""
    elif field_type == "number":
        return prop.get("number", 0)
    return ""


# ── METADATA ──────────────────────────────────────────────────────

METADATA = {
    "id": "ecommerce_task1",
    "name": "Batch Defect Assessment & Supplier Escalation",
    "category": "ecommerce",
    "environments": ["filesystem", "email", "notion", "google_sheets"],
    "timeout_seconds": 600,
    "difficulty": "hard",
    "mm_level": "L4",
    "role": "Chen Jie's QC assistant at Curato",
    "tags": [
        "qc", "complaints", "supplier", "defects",
        "multimodal", "cross-modal", "dedup", "escalation",
    ],
    "env_config": {
        "email": {
            "users": {
                "assistant": {
                    "email": "[email protected]",
                    "password": "assistant_pwd",
                },
                "chenjie": {
                    "email": "[email protected]",
                    "password": "chenjie_pwd",
                },
                "supplier_a": {
                    "email": "[email protected]",
                    "password": "supplier_a_pwd",
                },
                "zhaoqiang": {
                    "email": "[email protected]",
                    "password": "zhaoqiang_pwd",
                },
                "wanglei": {
                    "email": "[email protected]",
                    "password": "wanglei_pwd",
                },
            },
        },
        "google_sheets": {
            "task_id": "ecommerce_task1",
        },
    },
}

PROMPT = "Check your email and workspace for new complaint files to process."


# ── Stage Functions ───────────────────────────────────────────────

async def stage0(ctx):
    """2024-03-18 09:15: Environment setup & familiarization."""
    # 1. Upload assets (personality .md files + input materials)
    await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")

    # 2. Create empty Notion qc_events database
    await ctx.notion.create_page("Quality Events 2024-Q1")
    await ctx.notion.create_database(QC_DB_NAME, QC_DB_SCHEMA)

    # 3. Create Google Sheets return_rates + seed data
    sheet_info = await ctx.google_sheets.create_spreadsheet("return_rates")
    sheet_id = sheet_info["sheet_id"]
    await ctx.google_sheets.update_values(
        sheet_id, "Sheet1!A1:E4",
        [RETURN_RATES_HEADER] + RETURN_RATES_ROWS,
    )

    # 4. Seed email from Chen Jie
    await ctx.email.send_email(
        from_user="chenjie",
        to="[email protected]",
        subject="Return Rate Monitoring",
        body=(
            "Hi An,\n\n"
            "The warehouse says returns have been increasing recently. "
            "Please keep an eye on it and let me know if you see any patterns.\n\n"
            "Best,\nJie"
        ),
    )

    # 5. Notification — includes Feishu message as text
    return {
        "notification": (
            "[2024-03-18 Monday 09:15] You have a new email from Chen Jie "
            "and a Feishu message.\n\n"
            "Your email: [email protected]. "
            "Chen Jie: [email protected]. "
            "Zhao Qiang (manager): [email protected].\n"
            "QC event tracking: Notion database 'qc_events'. "
            "Return rates: Google Sheets 'return_rates'.\n"
            "Reference docs and data in /workspace/input/. "
            "SQLite database at /workspace/input/orders.db "
            "(query with sqlite3).\n\n"
            "[Feishu] Li (Customer Service): "
            "\"Hey An, I've put a batch of files in the complaints/ folder "
            "— recent negative reviews and returns data. "
            "Please take a look when you get a chance. "
            "Some customers sound pretty upset.\"\n\n"
            "Note: The complaints/ folder is currently empty. "
            "New complaint files will arrive shortly."
        ),
        "time": "2024-03-18T09:15:00+08:00",
    }


async def stage1(ctx):
    """2024-03-18 10:30: First batch of complaints arrives."""
    # 1. Inject complaint files into /workspace/input/complaints/
    inject_dir = ctx.task_dir / "inject" / "stage1"
    await ctx.fs.upload_dir(inject_dir, "/workspace/input/complaints")

    # 2. Notification — Feishu from Li
    return {
        "notification": (
            "[2024-03-18 Monday 10:30] New complaint files have arrived.\n\n"
            "[Feishu] Li (Customer Service): "
            "\"New complaints are in the complaints/ folder. "
            "Please analyze and classify them. "
            "If it looks like a batch issue, file a case.\"\n\n"
            "Policy reference: /workspace/input/returns_policy_v2.pdf. "
            "Data: /workspace/input/sales_by_lot.xlsx "
            "and /workspace/input/orders.db."
        ),
        "time": "2024-03-18T10:30:00+08:00",
    }


async def stage2(ctx):
    """2024-03-19 14:00: Supplier reply + new silent complaints."""
    # 1. Loud: Supplier A sends sub-batch detail email
    await ctx.email.send_email(
        from_user="supplier_a",
        to="[email protected]",
        subject="RE: Quality Report — lot-240301 Sub-Batch Detail",
        body=(
            "Hi,\n\n"
            "Regarding lot-240301, we've traced the production records and "
            "found the lot was split into two sub-batches:\n\n"
            "- A1: Produced 2024-03-01, Line 1, new mold, 600 units\n"
            "- A2: Produced 2024-03-03, Line 2, old mold, 600 units\n\n"
            "The sub-batch detail spreadsheet is attached.\n"
            "Please refer to the file at "
            "/workspace/input/complaints/sub_batch_detail.xlsx\n\n"
            "Regards,\nMingfeng Supply Chain"
        ),
    )

    # 2. Upload sub_batch_detail.xlsx as email attachment equivalent
    inject_s2 = ctx.task_dir / "inject" / "stage2"
    await ctx.fs.upload_file(
        inject_s2 / "sub_batch_detail.xlsx",
        "/workspace/input/complaints/sub_batch_detail.xlsx",
    )

    # 3. Silent: Inject new complaint files (agent should discover)
    await ctx.fs.upload_file(
        inject_s2 / "review_04.png",
        "/workspace/input/complaints/review_04.png",
    )
    await ctx.fs.upload_file(
        inject_s2 / "call_03.mp3",
        "/workspace/input/complaints/call_03.mp3",
    )

    # 4. Silent: Update Sheets return_count for lot-240302 (9 → 14)
    sheet_id = await ctx.google_sheets.get_spreadsheet_id("return_rates")
    if sheet_id:
        await ctx.google_sheets.update_values(
            sheet_id, "Sheet1!D3", [["14"]],
        )
        await ctx.google_sheets.update_values(
            sheet_id, "Sheet1!E3", [["1.75%"]],
        )

    # 5. Notification — only mentions the loud event
    return {
        "notification": "You have a new email from a supplier.",
        "time": "2024-03-19T14:00:00+08:00",
    }


async def stage3(ctx):
    """2024-03-25 Monday 09:00: Escalation & weekly report."""
    # 1. Silent: Mark Notion events as overdue (supplier 48h SLA breached)
    rows = await ctx.notion.query_db(QC_DB_NAME)
    for row in rows:
        status = _get_notion_field(row, "status", "select")
        if status.lower() in ("open", "investigating"):
            await ctx.notion.update_db_row(row["id"], {
                "notes": _notion_text(
                    "Supplier response overdue — 48h SLA breached"
                ),
            })
            break

    # 2. Notification — Feishu from Chen Jie + it's Monday (weekly report)
    return {
        "notification": (
            "[2024-03-25 Monday 09:00] New Feishu message. "
            "It's Monday — weekly reports are due.\n\n"
            "[Feishu] Chen Jie: "
            "\"Supplier A still hasn't responded? "
            "Escalate this — CC procurement director Wang Lei "
            "([email protected]). We need answers.\""
        ),
        "time": "2024-03-25T09:00:00+08:00",
    }

task2