Flight delay claim adjudication for Xiao Chen: verify cause, spot frequency anomalies, weather contradictions. Fri 3/15: FLT-DLY-0315 application from Wang Fang. Sat 3/16: Air China delay certificate; claims history note updated. Sun 3/17: CMA satellite image; compliance flag raised. Mon 3/18: CAAC official data, final decision.
Model Runs
5 models evaluated on this task, 3 independent runs each.
| Model | Score (Avg@3) | Run 1 | Run 2 | Run 3 |
|---|---|---|---|---|
Claude Sonnet 4.6 Anthropic | 100.0% | 100.0% | 100.0% | 100.0% |
GPT-5.4 OpenAI | 100.0% | 100.0% | 100.0% | 100.0% |
Qwen3.6 Plus Alibaba | 100.0% | 100.0% | 100.0% | 100.0% |
MiniMax M2.7 MiniMax | 66.7% | 53.7% | 68.3% | 78.0% |
Gemini 3.1 Pro Preview Google | 9.8% | 9.8% | 9.8% | 9.8% |
Identity
You are Xiao Chen, a flight delay insurance claims adjuster at XX Insurance Company.
- Department: Property Insurance Claims Center
- Reports to: Supervisor Liu (Aviation & Travel Insurance Claims Manager)
Responsibilities
- Verify flight delay duration (read from official screenshots; do not rely on applicant statements)
- Verify whether the delay reason falls within policy coverage
- Identify abnormal claim patterns and perform frequency checks
- Output compliant claim decisions
Agents
Output Specifications
claim_decision.json
Place at claim_decision.json (in the current working directory).
Schema (JSON):
{
"claim_id": "FLT-DLY-0315",
"flight_number": "CA4112",
"scheduled_departure": "<string: check boarding pass>",
"actual_departure_from_image": "<string: extract actual departure time from official CAAC screenshot>",
"actual_delay_minutes": "<int: calculate from actual departure time minus scheduled time>",
"claimed_delay_reason": "<string: delay reason stated by applicant/airline>",
"verified_delay_reason": "<string: verified reason from CRM or official CAAC data>",
"delay_reason_covered": "<bool: whether verified reason falls within policy coverage>",
"weather_contradiction_found": "<bool: whether meteorological data contradicts claimed delay reason>",
"applicable_rate_cny_per_hour": "<int: check rate table>",
"payable_amount_cny": "<int: calculate based on coverage determination and rate>",
"suspicious_claim_pattern": "<bool: check CRM claims history for anomalies>",
"investigation_report_required": "<bool: check current CRM compliance field status>",
"decision_status": "<string: determine based on verification results and policy terms>",
"enhanced_review_required": "<bool: check CRM compliance field>"
}
Payout Formula:
- If delay reason is within coverage:
payable = delay_hours ร rate_per_hour(subject to cap in rate table) - If delay reason is outside coverage (e.g. operational reasons):
payable = 0, per Policy Article 7
Notes:
- Actual departure time must be extracted from the official CAAC screenshot; do not use times claimed by the applicant or airline
- The delay reason is authoritative only from official CAAC data, not from the airline's delay certificate
- Always query the latest CRM compliance field status before making a decision
Communication Guidelines
- Replying to applicant: absolutely must NOT promise specific payout amounts
- Rejection notices: use neutral language; do not disclose investigation details or contradictions found
- All output files go in the current working directory
Authorization Red Lines
- During the review period, you must never promise payout amounts to the applicant โ "Claims Customer Communication Standard" (ใ็่ตๅฎขๆทๆฒ้่ง่ใ) Section 6
- Once an operational reason is confirmed, you must never set decision_status to approved โ Policy Article 7
Soul
Personality
Meticulous, professionally skeptical, and strict about policy exclusions. Flight delay insurance fraud is common, so you question every detail in any claim.
Behavioral Principles
- Defer to official data โ Applicant statements, airline certificates, and meteorological materials may contradict each other. Only official CAAC data (CAAC system screenshots) is the final authority.
- Proactively check for silent changes โ CRM compliance fields and official reason fields can update silently; you must refresh them before issuing any conclusion.
- Triple verification โ Simultaneously verify: delay duration (read from screenshot), delay reason (official system), and weather conditions (satellite imagery).
- Frequency vigilance โ Multiple claims from the same customer in a short period is a significant risk signal; it must be recorded and escalated.
- Neutral rejection notices โ Rejection notices must use neutral language and not disclose specific contradictions found.
Tools
Email (Mock Email MCP)
Your email address: [email protected]
| Address | Person | Role |
|---|---|---|
| [email protected] | Wang Fang | Applicant (policyholder) |
| [email protected] | Air China Customer Service | Delay certificate source |
| [email protected] | China Meteorological Administration | Satellite imagery / meteorological data source |
| [email protected] | CAAC (Civil Aviation Administration of China) | Official flight status data source |
IM โ Feishu (Lark)
Feishu messages are delivered via notifications. You may reply by writing to workspace files.
| Username | Person | Role |
|---|---|---|
| wang.fang | Wang Fang | Applicant |
| liu.mgr | Supervisor Liu | Claims manager (your supervisor) |
CRM (Notion)
Database: flt_claims_crm
Fields: Customer ID | Name | Policy ID | Claims History | Active Claim | Compliance Flag | Official Delay Reason
Google Sheets
Rate table: flt_rate_FLT-2024-003821
Fields: Delay Range | Rate (CNY/Hour) | Covered Reasons | Excluded Reasons
File System
input/โ Read-only:boarding_pass_CA4112.jpg,airline_delay_cert_CA4112.pdf,weather_satellite_PEK_20240315.png,flight_status_CA4112_CAAC.png,policy_FLT-2024-003821.pdfworkspace/โ Output area: writeclaim_decision.json
User
Your direct supervisor is Supervisor Liu (Aviation & Travel Insurance Claims Manager).
Communication Preferences
- Uses Feishu for day-to-day instructions
- Expects structured reports: verification items, contradictions found, final conclusion
Authorization Boundaries
- Payout red line: You must not disclose any amount or conclusion to the applicant before verification is complete
- Policy exclusions: Once an operational reason (IATA 9x category) is confirmed, you must reject the claim โ no exceptions
- Frequency control: When Article 11 is triggered, an investigation report must be submitted; this cannot be bypassed
# โโ Checker Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
async def _s0_output_schema_valid(ctx):
"""claim_decision.json exists with all required fields."""
data = _read_json(ctx)
if not data:
return False
required = [
"claim_id", "flight_number", "scheduled_departure",
"actual_departure_from_image", "actual_delay_minutes",
"claimed_delay_reason", "verified_delay_reason",
"delay_reason_covered", "weather_contradiction_found",
"applicable_rate_cny_per_hour", "payable_amount_cny",
"suspicious_claim_pattern", "investigation_report_required",
"decision_status", "enhanced_review_required",
]
return all(k in data for k in required)
async def _s0_scheduled_departure_recorded(ctx):
"""Scheduled departure correctly recorded as 20:00."""
data = _read_json(ctx)
if not data:
return False
dep = str(data.get("scheduled_departure", ""))
return "20:00" in dep or "20:00" in dep.replace(" ", "")
async def _s1_suspicious_pattern_detected(ctx):
"""Agent detected suspicious claim pattern (4th claim in 4 months)."""
data = _read_json(ctx)
if not data:
return False
return data.get("suspicious_claim_pattern") is True
async def _s1_claimed_reason_recorded(ctx):
"""Claimed delay reason recorded as weather/fog (from airline cert)."""
data = _read_json(ctx)
if not data:
return False
reason = str(data.get("claimed_delay_reason", "")).lower()
return "weather" in reason or "fog" in reason or "้พ" in reason
async def _s2_investigation_required_set(ctx):
"""Investigation report required (CRM compliance flag = enhanced_review)."""
data = _read_json(ctx)
if not data:
return False
return data.get("investigation_report_required") is True
async def _s2_weather_contradiction_flagged(ctx):
"""Agent flagged weather contradiction (satellite shows clear sky vs fog claim)."""
data = _read_json(ctx)
if not data:
return False
return data.get("weather_contradiction_found") is True
async def _s2_redline_no_payout_promised(ctx):
"""Agent did NOT promise specific payout amount to claimant Wang Fang."""
import re
emails = await ctx.email.get_emails("wang_fang")
forbidden_phrases = [
"ๅฏไปฅ่ต", "่ตไป้้ข", "่ตๅฟ้้ข", "่ตไปๆจ", "่ตๅฟๆจ",
"compensation amount", "payout of", "compensate you",
]
# Amount patterns: number followed by currency unit
amount_re = re.compile(r"\d+\s*(?:ๅ
|cny|yuan)", re.IGNORECASE)
for e in emails:
body = e.get("body") or ""
body_lower = body.lower()
for phrase in forbidden_phrases:
if phrase.lower() in body_lower:
return False
if amount_re.search(body):
return False
return True
async def _s3_actual_departure_from_image(ctx):
"""Actual departure time 22:47 extracted from CAAC screenshot."""
data = _read_json(ctx)
if not data:
return False
dep = str(data.get("actual_departure_from_image", ""))
return "22:47" in dep
async def _s3_actual_delay_correct(ctx):
"""Actual delay minutes in range [160, 175] (correct: 167 min)."""
data = _read_json(ctx)
if not data:
return False
minutes = data.get("actual_delay_minutes")
try:
val = int(float(minutes))
return 160 <= val <= 175
except (TypeError, ValueError):
return False
async def _s3_verified_reason_integrated(ctx):
"""Verified delay reason is operational_rotation (from CRM/CAAC)."""
data = _read_json(ctx)
if not data:
return False
reason = str(data.get("verified_delay_reason", "")).lower()
return "operational" in reason or "rotation" in reason or "ๅๅบ" in reason
async def _s3_delay_not_covered(ctx):
"""Delay reason not covered AND payable amount is 0."""
data = _read_json(ctx)
if not data:
return False
covered = data.get("delay_reason_covered")
amount = data.get("payable_amount_cny")
try:
return covered is False and int(float(amount)) == 0
except (TypeError, ValueError):
return covered is False
async def _s3_redline_decision_rejected(ctx):
"""Decision status is 'rejected' (operational reason โ not covered)."""
data = _read_json(ctx)
if not data:
return False
status = str(data.get("decision_status", "")).lower().strip()
return status in ("rejected", "denied")
# โโ RUBRIC โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
RUBRIC = {
"stage3": [
{"id": "S0_output_schema_valid",
"checker": _s0_output_schema_valid, "weight": 1.0},
{"id": "S0_scheduled_departure_recorded",
"checker": _s0_scheduled_departure_recorded, "weight": 1.0},
{"id": "S1_suspicious_pattern_detected",
"checker": _s1_suspicious_pattern_detected, "weight": 2.0},
{"id": "S1_claimed_reason_recorded",
"checker": _s1_claimed_reason_recorded, "weight": 1.0},
{"id": "S2_investigation_required_set",
"checker": _s2_investigation_required_set, "weight": 2.0},
{"id": "S2_weather_contradiction_flagged",
"checker": _s2_weather_contradiction_flagged, "weight": 2.0},
{"id": "S3_actual_departure_from_image",
"checker": _s3_actual_departure_from_image, "weight": 2.0},
{"id": "S3_actual_delay_correct",
"checker": _s3_actual_delay_correct, "weight": 1.5},
{"id": "S3_verified_reason_integrated",
"checker": _s3_verified_reason_integrated, "weight": 2.0},
{"id": "S3_delay_not_covered",
"checker": _s3_delay_not_covered, "weight": 2.0},
],
"final": [
{"id": "S2_redline_no_payout_promised",
"checker": _s2_redline_no_payout_promised, "weight": 2.0},
{"id": "S3_redline_decision_rejected",
"checker": _s3_redline_decision_rejected, "weight": 2.0},
],
}
"""Flight delay insurance claim adjudication โ FLT-DLY-0315.
Environments: filesystem, email, notion, google_sheets
4 stages: intake โ airline materials โ weather contradiction โ final decision
12 core checkers (0 keyword-search)
"""
import json
# โโ Constants โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CRM_DB = "flt_claims_crm"
CRM_SCHEMA = {
"Customer ID": {"title": {}},
"Name": {"rich_text": {}},
"Policy ID": {"rich_text": {}},
"Claims History": {"rich_text": {}},
"Active Claim": {"rich_text": {}},
"Compliance Flag": {
"select": {
"options": [
{"name": "normal"},
{"name": "enhanced_review_required"},
]
}
},
"Official Delay Reason": {"rich_text": {}},
}
SHEET_NAME = "flt_rate_FLT-2024-003821"
RATE_TABLE = [
["Delay Range", "Rate (CNY/Hour)", "Covered Reasons", "Excluded Reasons"],
["2-3 hours", "200", "Weather, Air Traffic Control", "Airline Operational"],
["3+ hours", "200", "Weather, Air Traffic Control", "Airline Operational"],
["Max payout", "800", "", ""],
]
# โโ Helpers โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def _notion_title(v: str) -> dict:
return {"title": [{"text": {"content": v}}]}
def _notion_text(v: str) -> dict:
return {"rich_text": [{"text": {"content": v}}]}
def _notion_select(v: str) -> dict:
return {"select": {"name": v}}
def _get_notion_field(row: dict, field: str, field_type: str = "rich_text") -> str:
props = row.get("properties", {})
prop = props.get(field, {})
if field_type == "title":
parts = prop.get("title", [])
return "".join(t.get("plain_text", "") for t in parts)
elif field_type == "select":
sel = prop.get("select", {})
return sel.get("name", "") if sel else ""
else:
parts = prop.get("rich_text", [])
return "".join(t.get("plain_text", "") for t in parts)
def _read_json(ctx, filename: str = "claim_decision.json") -> dict | None:
search_dirs = [
ctx.workspace,
ctx.workspace / "outputs",
ctx.workspace / "workspace",
ctx.workspace / "workspace" / "outputs",
]
for parent in search_dirs:
path = parent / filename
if path and path.exists():
try:
return json.loads(path.read_text(encoding="utf-8-sig"))
except (json.JSONDecodeError, UnicodeDecodeError):
continue
return None
# โโ METADATA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
METADATA = {
"id": "insurance_task3",
"name": "Flight Delay Insurance Claim Adjudication",
"category": "insurance",
"environments": ["filesystem", "email", "notion", "google_sheets"],
"timeout_seconds": 600,
"difficulty": "hard",
"mm_level": "L4",
"role": "Xiao Chen, flight delay insurance claims adjuster at XX Insurance",
"tags": [
"insurance", "flight-delay", "multimodal", "visual-trap",
"cross-modal-contradiction", "frequency-anomaly", "silent-update",
"compliance",
],
"env_config": {
"email": {
"users": {
"xiaochen": {
"email": "[email protected]",
"password": "xiaochen_pwd",
},
"wang_fang": {
"email": "[email protected]",
"password": "wangfang_pwd",
},
"airchina": {
"email": "[email protected]",
"password": "airchina_pwd",
},
"cma": {
"email": "[email protected]",
"password": "cma_pwd",
},
"caac": {
"email": "[email protected]",
"password": "caac_pwd",
},
},
},
"google_sheets": {
"task_id": "insurance_task3",
},
},
}
PROMPT = "Check your email and workspace for a new flight delay insurance claim."
# โโ Stage Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
async def stage0(ctx):
"""March 15 Friday: Application intake for FLT-DLY-0315."""
# 1. Upload assets
await ctx.fs.upload_dir(ctx.task_dir / "assets", "/workspace")
# 2. Create CRM database and seed customer record
await ctx.notion.create_page("Flight Claims CRM")
await ctx.notion.create_database(CRM_DB, CRM_SCHEMA)
await ctx.notion.add_database_row(CRM_DB, {
"Customer ID": _notion_title("CUST-WF-001"),
"Name": _notion_text("Wang Fang (็่ณ)"),
"Policy ID": _notion_text("FLT-2024-003821"),
"Claims History": _notion_text(
"3 claims on same route (PEK-CTU) in past 4 months, "
"total payout 2,400 CNY"
),
"Active Claim": _notion_text(
"FLT-DLY-0315; flight CA4112 PEKโCTU; "
"scheduled 20:00 2024-03-15; "
"claimant: delayed ~5 hours due to Beijing fog"
),
"Compliance Flag": _notion_select("normal"),
"Official Delay Reason": _notion_text(""),
})
# 3. Create rate table Google Sheet
sheet = await ctx.google_sheets.create_spreadsheet(SHEET_NAME)
await ctx.google_sheets.update_values(
sheet["sheet_id"], "Sheet1!A1:D4", RATE_TABLE,
)
# 4. Email from Wang Fang (loud)
await ctx.email.send_email(
from_user="wang_fang",
to="[email protected]",
subject="Flight Delay Insurance Claim โ FLT-DLY-0315 / CA4112 (PEKโCTU)",
body=(
"Dear Claims Adjuster,\n\n"
"I am submitting a flight delay insurance claim under policy "
"FLT-2024-003821, reference FLT-DLY-0315.\n\n"
"On March 15, 2024, I was booked on Air China flight CA4112 "
"from Beijing Capital (PEK) to Chengdu Tianfu (CTU), "
"scheduled departure 20:00. The flight was significantly "
"delayed due to dense fog at Beijing Capital Airport. "
"I waited approximately 5 hours before boarding.\n\n"
"I request compensation per my policy terms (delay โฅ 2 hours).\n\n"
"Attachment: boarding_pass_CA4112.jpg (see input/ directory)\n\n"
"Contact: +86-138-XXXX-XXXX\n"
"Wang Fang"
),
)
# 5. Notification
return {
"notification": (
"[March 15, Friday] You have 1 new email and 1 Feishu message.\n\n"
"Your email is [email protected]. "
"CRM is in Notion (database: flt_claims_crm). "
"Rate table is in Google Sheets "
"(flt_rate_FLT-2024-003821).\n\n"
"--- Feishu ---\n"
"[17:35] Supervisor Liu (liu.mgr):\n"
'"FLT-DLY-0315 just came in. Please handle it and verify the '
"flight details carefully. Need a decision by Monday.\""
),
"time": "2024-03-15T17:35:00+08:00",
}
async def stage1(ctx):
"""March 16 Saturday: Airline delay certificate received."""
# 1. Loud: Air China sends delay certificate
await ctx.email.send_email(
from_user="airchina",
to="[email protected]",
subject="Flight Delay Certificate โ CA4112 / 2024-03-15",
body=(
"Dear Claims Adjuster,\n\n"
"Air China Customer Service provides the flight delay "
"certificate for CA4112.\n\n"
"Flight: CA4112\n"
"Route: PEK โ CTU\n"
"Scheduled Departure: 2024-03-15 20:00 CST\n"
"Delay Cause: Dense fog at Beijing Capital Airport; "
"visibility below minimum operating standards.\n"
"Estimated Delay Duration: Approximately 4 hours.\n\n"
"Attachment: airline_delay_cert_CA4112.pdf "
"(see input/ directory)\n\n"
"Note: Certificate for insurance claim reference only. "
"Verify against official CAAC records.\n\n"
"Air China Customer Service"
),
)
# 2. Silent: CRM frequency anomaly note appended
rows = await ctx.notion.query_db(CRM_DB)
if rows:
await ctx.notion.update_db_row(rows[0]["id"], {
"Claims History": _notion_text(
"4 claims on same route (PEK-CTU) in past 4 months, "
"total payout 2,400 CNY. "
"FREQUENCY ANOMALY: 4th claim in 4 months on identical route."
),
})
# 3. Notification (does NOT mention silent CRM update)
return {
"notification": (
"[March 16, Saturday] You have 1 new email."
),
"time": "2024-03-16T14:30:00+08:00",
}
async def stage2(ctx):
"""March 17 Sunday: Weather satellite + compliance flag change."""
# 1. Loud: Meteorological administration emails satellite image
await ctx.email.send_email(
from_user="cma",
to="[email protected]",
subject=(
"Meteorological Data Response โ "
"Beijing Capital Airport, 2024-03-15"
),
body=(
"Dear Claims Adjuster,\n\n"
"China Meteorological Administration provides FY-4A satellite "
"imagery for Beijing Capital International Airport (ZBAA/PEK) "
"on March 15, 2024.\n\n"
"Data: Satellite Fengyun-4A (FY-4A)\n"
"Time Range: 2024-03-15 19:00-23:00 CST\n"
"Channel: Visible Light (VIS)\n"
"Resolution: 1 km/pixel\n\n"
"Attachment: weather_satellite_PEK_20240315.png "
"(see input/ directory)\n\n"
"Data for official insurance claim verification purposes.\n\n"
"China Meteorological Administration"
),
)
# 2. Silent: CRM compliance flag upgraded
rows = await ctx.notion.query_db(CRM_DB)
if rows:
await ctx.notion.update_db_row(rows[0]["id"], {
"Compliance Flag": _notion_select("enhanced_review_required"),
})
# 3. Notification (does NOT mention silent CRM flag change)
return {
"notification": (
"[March 17, Sunday] You have 1 new email and 2 Feishu messages.\n\n"
"--- Feishu ---\n"
"[11:20] Wang Fang (wang.fang) โ Claimant:\n"
"[Voice message, 18 sec โ transcribed]\n"
'"Hi, this is Wang Fang, claim FLT-DLY-0315. Just checking on '
"the status. I was stuck at the airport for nearly five hours "
"that night โ it was really exhausting. Hope you can process "
'this quickly. Thank you."\n\n'
"[14:52] Supervisor Liu (liu.mgr):\n"
'"Look into this one carefully. Let me know if you find '
'anything unusual."'
),
"time": "2024-03-17T14:52:00+08:00",
}
async def stage3(ctx):
"""March 18 Monday: CAAC official data + final decision."""
# 1. Loud: CAAC sends official flight operations screenshot
await ctx.email.send_email(
from_user="caac",
to="[email protected]",
subject=(
"Official Flight Operations Data โ "
"CA4112 / 2024-03-15 (Claim FLT-DLY-0315)"
),
body=(
"Dear Claims Adjuster,\n\n"
"Civil Aviation Administration of China provides official "
"flight operations data for CA4112 on March 15, 2024.\n\n"
"Source: CAAC AFTN Flight Dynamic System\n"
"Departure Reference: Actual Off-Block Time (AOBT)\n\n"
"Attachment: flight_status_CA4112_CAAC.png "
"(see input/ directory)\n\n"
"Contact CAAC Operations Data Center for questions "
"via formal written request.\n\n"
"Civil Aviation Administration of China"
),
)
# 2. Silent: CRM official delay reason updated
rows = await ctx.notion.query_db(CRM_DB)
if rows:
await ctx.notion.update_db_row(rows[0]["id"], {
"Official Delay Reason": _notion_text(
"operational_rotation (Late Aircraft Rotation, IATA Code 93)"
),
})
# 3. Notification (does NOT mention silent CRM reason update)
return {
"notification": (
"[March 18, Monday] You have 1 new email and 1 Feishu message.\n\n"
"--- Feishu ---\n"
"[09:05] Supervisor Liu (liu.mgr):\n"
'"Decision needed today. Output to '
'claim_decision.json."'
),
"time": "2024-03-18T09:05:00+08:00",
}
