Notes on AI assessment.

A running publication for paper notes, research explainers, and field guides on automated scoring, LLM-as-judge, psychometrics, multimodal assessment, and AI evaluation.

Current shelf

curated papers across assessment AI, LLM-as-judge, and AI evaluation.

Featured

LLM-as-judge is becoming assessment infrastructure.

The research on model judges is not only about chatbot leaderboards. It gives scoring teams a practical vocabulary for rubric fidelity, judge bias, calibration, abstention, and human agreement.

Read note

Why this matters

Rubric fidelity

Does the judge apply the actual criterion, or reward fluent-looking answers?

Bias and order effects

Does response position, verbosity, author style, language background, or model identity shift the score?

Calibration and abstention

Can the judge know when it should route work to a human, request anchors, or lower confidence?

Human agreement

Does the AI match expert raters at the item, criterion, subgroup, and score-band level?

Research guides

Focused guides for the core research topics.

Each guide has its own URL, metadata, structured data, related topics, and paper anchors. The Library remains the hub; the guide pages answer specific questions in depth.

LLM-as-Judge

LLM-as-judge for assessment

A research guide to LLM-as-judge reliability, rubric fidelity, bias, calibration, human agreement, and assessment use cases.

Open guide

Judge reliability

LLM judge reliability

How to evaluate LLM judge reliability across human agreement, calibration, score stability, confidence routing, and rubric-specific performance.

Open guide

Judge bias

LLM judge bias and fairness

A guide to LLM judge bias, including response order effects, verbosity bias, model identity effects, subgroup checks, and human-in-loop controls.

Open guide

Automated essay scoring

Automated essay scoring with LLMs

Research and practical guidance on automated essay scoring, AI essay grading, rubric alignment, human agreement, feedback, and fairness.

Open guide

Constructed response

Constructed-response scoring

How AI scoring applies to short answers, science explanations, math work, evidence-based writing, partial credit, and rubric-based constructed responses.

Open guide

Validation

AI assessment validation

A practical guide to validating AI scoring systems with human agreement, confidence routing, subgroup checks, calibration, and audit replay.

Open guide

Rubric evaluation

Rubric-based LLM evaluation

How rubric-based LLM evaluation connects AI benchmarks, LLM-as-judge methods, automated scoring, expert rubrics, and psychometric validation.

Open guide

Benchmarks

AI evaluation benchmarks

A guide to AI evaluation benchmarks, live benchmarks, contamination, expert rubrics, agent testing, and what assessment can borrow from AI evaluation.

Open guide

Multimodal assessment

Multimodal AI assessment

How multimodal AI assessment scores handwriting, math notation, diagrams, speech, PDFs, lab work, essays, and mixed student submissions.

Open guide

Latest notes

Short reads for scoring leaders and AI builders.

Validation

What makes AI scoring defensible?

Agreement is not enough. A credible pilot needs anchors, confidence routing, subgroup checks, trace replay, and named human-first boundaries.

Read

Workflow design

When a subject deserves its own workflow page

A new workflow earns its place when submission format, rubric criteria, review protocol, or reporting expectations change.

Read

Platform

Why multimodal grading changes the rubric conversation

Handwriting, diagrams, lab images, proofs, speech, and mixed-language work force scoring systems to cite evidence beyond typed text.

Read

Assessment AI

AI for scoring student work

Papers about automated essay scoring, constructed-response scoring, fairness, consistency, and explainability in education and assessment.

Automated essay scoring with LLMs Constructed-response scoring AI assessment validation Multimodal AI assessment

multi-agentrubric alignmentinterpretability

AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue Sun, Ninghao Liu, Xiaoming Zhai · AAAI · 2026

Moves automated scoring away from one-shot grading by extracting rubric-relevant components before assigning scores, which is directly relevant to interpretable, audit-ready scoring workflows.

LLM-as-judge is becoming assessment infrastructure.

Rubric fidelity

Bias and order effects

Calibration and abstention

Human agreement

Focused guides for the core research topics.

LLM-as-judge for assessment

LLM judge reliability

LLM judge bias and fairness

Automated essay scoring with LLMs

Constructed-response scoring

AI assessment validation

Rubric-based LLM evaluation

AI evaluation benchmarks

Multimodal AI assessment

Short reads for scoring leaders and AI builders.

What makes AI scoring defensible?

When a subject deserves its own workflow page

Why multimodal grading changes the rubric conversation

AI for scoring student work

AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Exploring potential of large language models for automated essay scoring in education

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Examining the responsible use of zero-shot AI approaches to scoring essays

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Are Large Language Models Good Essay Graders?

On the Consistency of Automatic Scoring with Large Language Models

LLM-as-judge as its own subdomain

Bias and Uncertainty in LLM-as-a-Judge Estimation

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

A Survey on LLM-as-a-Judge

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Large Language Models are not Fair Evaluators

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

Benchmarks, testing, and evaluation design

Autorubric: Unifying Rubric-based LLM Evaluation

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

PaperBench: Evaluating AI's Ability to Replicate AI Research

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Humanity's Last Exam

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Holistic Evaluation of Language Models

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Measuring Massive Multitask Language Understanding

GPQA: A Graduate-Level Google-Proof Q&A Benchmark