LLM-as-judge is becoming assessment infrastructure.
The research on model judges is not only about chatbot leaderboards. It gives scoring teams a practical vocabulary for rubric fidelity, judge bias, calibration, abstention, and human agreement.
Read noteRubric fidelity
Does the judge apply the actual criterion, or reward fluent-looking answers?
Bias and order effects
Does response position, verbosity, author style, language background, or model identity shift the score?
Calibration and abstention
Can the judge know when it should route work to a human, request anchors, or lower confidence?
Human agreement
Does the AI match expert raters at the item, criterion, subgroup, and score-band level?
