Skip to content
Evalysis

Validation results, not product theater.

Validation decides where AI scoring is ready, where it fails, and when responses should route to humans. The standard is not demo fluency; it is agreement, calibration, fairness, and replayability.

The validation package every pilot should produce.

Evalysis should be evaluated like a scoring system, not demoed like a chatbot. A pilot should produce agreement tables, routing thresholds, fairness checks, and examples that reviewers can inspect.

Anchor approval

Scoring leaders approve representative responses before the panel is allowed to score at scale.

Blind comparison

Human ratings and AI ratings are compared at item, criterion, score-band, and subgroup levels.

Confidence routing

Scores are separated into auto-ready, review-needed, and human-only bands based on observed agreement.

Subgroup checks

Agreement deltas are reported across program-defined groups, language backgrounds, accommodations, and item types.

Trace replay

Every sampled decision can be replayed with the original response, rubric, agent versions, and rationale.

Failure cases

Items that should not be auto-scored are named explicitly: poor rubrics, hard-to-read responses, sensitive content, or unstable agreement.

The report should answer four hard questions before launch.

The charts below are sample report components. In a real pilot, the values are replaced with the customer's own human-scored responses, rubrics, subjects, and subgroup definitions.

Does it match human scoring?

Report exact, adjacent, and weighted agreement against the program's own human-scored set.

Where does it fail?

Show item types, score bands, response quality levels, and subgroups where agreement drops.

When should humans review?

Set confidence thresholds and condition-code rules that route uncertain or sensitive work out of automation.

Can the decision be defended?

Provide a replayable trace: original response, parsed work, rubric criteria, panel votes, and final rationale.

Agreement

Human scoring vs AI scoring

Sample chart component: exact agreement by item.

Reliability

Rater stability before and after adjudication

Sample chart component: rater disagreement and panel settlement.

Confidence

Confidence-binned accuracy

Sample chart component: the basis for review thresholds.

Fairness

Agreement across subgroups

Sample table component: deltas that trigger review.

What the customer should receive.

A useful validation brief should be readable by psychometricians, scoring leaders, legal reviewers, and technical teams. It needs both statistical summaries and concrete examples.

Pilot output

A launch decision, not just a slide deck.

The brief should say where AI scoring is ready, where human review remains required, and what alignment work is needed before scale.

Per-item Human-Human / Human-AI agreement
Criterion-level agreement and disagreement examples
Confidence-binned accuracy and routing thresholds
Subgroup fairness table with review triggers
Sample traces for agreed, disagreed, and escalated responses
Deployment recommendation: cloud, private cloud, or on-prem

Where validation should make us say no.

A credible system must be explicit about non-fit cases. Some items should remain human-only, and some deployments need more alignment before AI scoring is appropriate.

Human-first boundary

No stable rubric or anchor set

Human-first boundary

Low-quality submissions that humans cannot reliably read

Human-first boundary

Sensitive safeguarding or disciplinary content

Human-first boundary

Subgroup deltas that remain unexplained after review

Human-first boundary

Item types where confidence does not predict accuracy

Human-first boundary

Operational context where audit custody must remain local

Research context for scoring governance.

The Library is reserved for paper notes and field essays: LLM-as- judge, automated scoring, AI benchmarks, and assessment research.