LLM-as-judge for assessment and scoring.
Evalysis uses model judging research as part of a larger assessment system: rubrics, anchors, calibration, confidence routing, and human adjudication.
A practical overview.
A judge is not a scoring program by itself
A model can evaluate a response, but assessment requires evidence handling, policy, validation, auditability, fairness checks, and clear human boundaries.
Rubric fidelity is the center
The judge must apply the criterion being measured instead of rewarding fluency, length, style, or a familiar answer pattern.
Reliability comes from the workflow
Independent passes, critic review, calibration data, confidence thresholds, and human escalation make judging safer than a single one-shot model score.
Library topics that support this page.
LLM-as-judge for assessment
A research guide to LLM-as-judge reliability, rubric fidelity, bias, calibration, human agreement, and assessment use cases.
LLM judge reliability
How to evaluate LLM judge reliability across human agreement, calibration, score stability, confidence routing, and rubric-specific performance.
LLM judge bias and fairness
A guide to LLM judge bias, including response order effects, verbosity bias, model identity effects, subgroup checks, and human-in-loop controls.