Every AI vendor will tell you their solution "works great." The question is: how do you actually know? LLM evaluation in production is one of the most underappreciated challenges in applied AI.
Why Standard Metrics Fall Short
Traditional ML metrics (accuracy, precision, recall) assume clear ground truth. But for LLM-powered systems — agents that write emails, analyse documents, or answer questions — "correct" is often subjective and multidimensional.
An email response might be factually accurate but tonally wrong. A document summary might capture the key points but miss a critical nuance. A classification might be technically correct but unhelpful in context.
Our Evaluation Framework
We evaluate production LLM systems across four dimensions:
1. Correctness
Does the output contain factual errors, hallucinations, or contradictions?
How we measure:
- Automated fact-checking against source documents
- Structured output validation (do extracted fields match expected formats?)
- Contradiction detection within the output itself
- Spot-check sampling by human reviewers (5-10% of outputs)
2. Completeness
Does the output address everything it should?
How we measure:
- Required-field coverage (for structured extraction tasks)
- Key-point checklists (for summarisation tasks)
- Comparison against human-generated reference outputs
- Missing information detection (what should be there but isn't?)
3. Usefulness
Does the output actually help the user accomplish their goal?
How we measure:
- User action rates (did they use the output or ignore it?)
- Edit distance (how much did humans change the AI output before using it?)
- Task completion time (faster with AI than without?)
- Escalation rates (how often does AI output lead to human intervention?)
4. Safety
Does the output avoid harm, bias, or policy violations?
How we measure:
- Automated policy compliance checking
- Bias audits across demographic dimensions
- Tone and sentiment analysis for customer-facing outputs
- Jailbreak and prompt injection testing
Evaluation in Practice
For a typical Digital Employee deployment, we set up three layers of evaluation:
Layer 1: Automated (every output)
- Schema validation, fact-checking against source data, confidence scoring
- Runs in real-time, flags outputs below threshold for human review
Layer 2: Sampling (5-10% of outputs)
- Human reviewers score a random sample across all four dimensions
- Weekly reports track trends and catch drift
Layer 3: Deep audit (monthly)
- Comprehensive review of edge cases, escalations, and user feedback
- Bias testing and safety review
- Model performance comparison against baseline
The Feedback Loop
Evaluation without action is just measurement. The critical step is closing the loop:
- **Identify** systematic issues from evaluation data
- **Diagnose** root cause (prompt issue? knowledge gap? model limitation?)
- **Fix** via prompt updates, knowledge base additions, or model changes
- **Validate** the fix resolved the issue without introducing regressions
- **Monitor** to ensure the fix holds over time
Common Pitfalls
Evaluating on vibes. "It seems to work well" isn't evaluation. Without structured measurement, you won't catch gradual degradation.
Over-indexing on accuracy. 99% accuracy on the wrong metric is worthless. Measure what matters for the business outcome.
Ignoring edge cases. Production data is messier than test data. The 2% of inputs that don't fit your expected patterns are where failures hide.
Evaluating once. LLM performance changes — model updates, data drift, usage pattern shifts. Evaluation must be continuous.
Why This Matters
Clients ask us: "How do I know this AI is actually working?" Our answer: you measure it, rigorously and continuously. The evaluation framework is as important as the AI itself — it's what turns a demo into a production system you can trust.


