Validation results, not product theater.
Validation decides where AI scoring is ready, where it fails, and when responses should route to humans. The standard is not demo fluency; it is agreement, calibration, fairness, and replayability.
The validation package every pilot should produce.
Evalysis should be evaluated like a scoring system, not demoed like a chatbot. A pilot should produce agreement tables, routing thresholds, fairness checks, and examples that reviewers can inspect.
Anchor approval
Scoring leaders approve representative responses before the panel is allowed to score at scale.
Blind comparison
Human ratings and AI ratings are compared at item, criterion, score-band, and subgroup levels.
Confidence routing
Scores are separated into auto-ready, review-needed, and human-only bands based on observed agreement.
Subgroup checks
Agreement deltas are reported across program-defined groups, language backgrounds, accommodations, and item types.
Trace replay
Every sampled decision can be replayed with the original response, rubric, agent versions, and rationale.
Failure cases
Items that should not be auto-scored are named explicitly: poor rubrics, hard-to-read responses, sensitive content, or unstable agreement.
The report should answer four hard questions before launch.
The charts below are sample report components. In a real pilot, the values are replaced with the customer's own human-scored responses, rubrics, subjects, and subgroup definitions.
Does it match human scoring?
Report exact, adjacent, and weighted agreement against the program's own human-scored set.
Where does it fail?
Show item types, score bands, response quality levels, and subgroups where agreement drops.
When should humans review?
Set confidence thresholds and condition-code rules that route uncertain or sensitive work out of automation.
Can the decision be defended?
Provide a replayable trace: original response, parsed work, rubric criteria, panel votes, and final rationale.
Human scoring vs AI scoring
Sample chart component: exact agreement by item.
Rater stability before and after adjudication
Sample chart component: rater disagreement and panel settlement.
Confidence-binned accuracy
Sample chart component: the basis for review thresholds.
Agreement across subgroups
Sample table component: deltas that trigger review.
What the customer should receive.
A useful validation brief should be readable by psychometricians, scoring leaders, legal reviewers, and technical teams. It needs both statistical summaries and concrete examples.
A launch decision, not just a slide deck.
The brief should say where AI scoring is ready, where human review remains required, and what alignment work is needed before scale.
Where validation should make us say no.
A credible system must be explicit about non-fit cases. Some items should remain human-only, and some deployments need more alignment before AI scoring is appropriate.
No stable rubric or anchor set
Low-quality submissions that humans cannot reliably read
Sensitive safeguarding or disciplinary content
Subgroup deltas that remain unexplained after review
Item types where confidence does not predict accuracy
Operational context where audit custody must remain local
Research context for scoring governance.
The Library is reserved for paper notes and field essays: LLM-as- judge, automated scoring, AI benchmarks, and assessment research.
