Prompt library & accuracy audit

Flow 4 · P1 · FR10 — versioned prompt history and the weekly 20% manual-audit sample

Categorisation accuracy

91.4% vs. target ≥ 90%

Agreement with expert human review on this week's 20% audit sample (84 of 412 reviews)

Priority assignment accuracy

86.2% vs. target ≥ 85%

Agreement with manager judgement on the same audit sample

Hallucination rate (N/A reviews)

0% target = 0%

Cust2024-006 ("Nice place") and 11 similar short reviews returned Tags = N/A — zero false aspect tags this week

Versioned prompt change log

What changed, why, and the measured effect — owned by Sankar Kumar Palaniappan

  1. v2.3 · Active Deployed Jun 2, 2026

    Tightened the N/A rule for short/ambiguous reviews

    Added explicit guidance to return Aspect Tags = N/A for reviews under ~6 words (the Cust2024-006 case) instead of inferring tags from sentiment alone. Effect: hallucination rate dropped from 4% → 0%.

  2. v2.2 Deployed May 14, 2026

    Added idiom-awareness examples ("killer pasta" vs. "killed my appetite")

    Injected contrastive few-shot examples so the model weighs context over keyword roots. Effect: categorisation accuracy rose from 87% → 90%.

  3. v2.1 Deployed Apr 22, 2026

    Recalibrated priority criteria around business impact, not just polarity

    Re-weighted "High" toward safety/health flags and repeat-pattern complaints (e.g. "ambiance was terrifyingly unsafe" vs. "a bit quiet" — same aspect, different severity). Effect: priority agreement rose from 79% → 85%.

  4. v1.0 Deployed Mar 3, 2026 — pilot launch

    Initial engineered prompt — 5-column structured output

    Baseline version defining the Sentiment / Aspect Tags / Priority / Suggested Action / 1st Reply schema used across the pilot.

This week's 20% manual audit sample

FR10 — 84 of 412 reviews independently re-scored by Sankar against the live prompt output

Audit week of Jun 1–6
Customer ID AI sentiment Auditor sentiment AI priority Auditor priority Match?
Cust2024-001 Positive Negative Positive Negative High High
Cust2024-005 Neutral Neutral Normal Low
Cust2024-006 Positive Tags = N/A Positive Tags = N/A Low Low
Cust2024-008 Negative Negative Normal Normal
Cust2024-009 Negative Negative High High

Sample disagreement: Cust2024-005 — AI assigned Normal, auditor judged Low ("nothing special, no real risk of churn"). Logged for the v2.4 calibration review.

Reported miscategorisations — open queue

  • Open Cust2024-005 — "Priority felt too high for a low-stakes neutral review" — reported by Maria Reyes, Jun 6
  • Resolved in v2.3 Cust2024-006 — "Tagged 'Service' on a one-line review with no detail" — reported by Jordan Tate, May 28