Skip to content
Evalysis

Notes on AI assessment.

A running publication for paper notes, research explainers, and field guides on automated scoring, LLM-as-judge, psychometrics, multimodal assessment, and AI evaluation.

Current shelf
28

curated papers across assessment AI, LLM-as-judge, and AI evaluation.

Featured

LLM-as-judge is becoming assessment infrastructure.

The research on model judges is not only about chatbot leaderboards. It gives scoring teams a practical vocabulary for rubric fidelity, judge bias, calibration, abstention, and human agreement.

Read note
Why this matters

Rubric fidelity

Does the judge apply the actual criterion, or reward fluent-looking answers?

Bias and order effects

Does response position, verbosity, author style, language background, or model identity shift the score?

Calibration and abstention

Can the judge know when it should route work to a human, request anchors, or lower confidence?

Human agreement

Does the AI match expert raters at the item, criterion, subgroup, and score-band level?

Research guides

Focused guides for the core research topics.

Each guide has its own URL, metadata, structured data, related topics, and paper anchors. The Library remains the hub; the guide pages answer specific questions in depth.

LLM-as-Judge

LLM-as-judge for assessment

A research guide to LLM-as-judge reliability, rubric fidelity, bias, calibration, human agreement, and assessment use cases.

Open guide
Judge reliability

LLM judge reliability

How to evaluate LLM judge reliability across human agreement, calibration, score stability, confidence routing, and rubric-specific performance.

Open guide
Judge bias

LLM judge bias and fairness

A guide to LLM judge bias, including response order effects, verbosity bias, model identity effects, subgroup checks, and human-in-loop controls.

Open guide
Automated essay scoring

Automated essay scoring with LLMs

Research and practical guidance on automated essay scoring, AI essay grading, rubric alignment, human agreement, feedback, and fairness.

Open guide
Constructed response

Constructed-response scoring

How AI scoring applies to short answers, science explanations, math work, evidence-based writing, partial credit, and rubric-based constructed responses.

Open guide
Validation

AI assessment validation

A practical guide to validating AI scoring systems with human agreement, confidence routing, subgroup checks, calibration, and audit replay.

Open guide
Rubric evaluation

Rubric-based LLM evaluation

How rubric-based LLM evaluation connects AI benchmarks, LLM-as-judge methods, automated scoring, expert rubrics, and psychometric validation.

Open guide
Benchmarks

AI evaluation benchmarks

A guide to AI evaluation benchmarks, live benchmarks, contamination, expert rubrics, agent testing, and what assessment can borrow from AI evaluation.

Open guide
Multimodal assessment

Multimodal AI assessment

How multimodal AI assessment scores handwriting, math notation, diagrams, speech, PDFs, lab work, essays, and mixed student submissions.

Open guide
Latest notes

Short reads for scoring leaders and AI builders.

Assessment AI

AI for scoring student work

Papers about automated essay scoring, constructed-response scoring, fairness, consistency, and explainability in education and assessment.

multi-agentrubric alignmentinterpretability

AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue Sun, Ninghao Liu, Xiaoming Zhai · AAAI · 2026

Moves automated scoring away from one-shot grading by extracting rubric-relevant components before assigning scores, which is directly relevant to interpretable, audit-ready scoring workflows.

Source
AESbenchmarksrater bias

Exploring potential of large language models for automated essay scoring in education

Nimra Mughal, Ali Shariq Imran, Sher Muhammad Daudpota, Zenun Kastrati, Waheed Noor · Discover Artificial Intelligence · 2026

A current open-access AES study comparing GPT and Gemini on benchmark and classroom data, useful for tracking how LLM scoring performs under real rubric and rater-bias conditions.

Source
multimodalAES benchmarkwriting traits

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu · Findings of ACL · 2025

Adds a multimodal AES benchmark across lexical, sentence, and discourse traits, highlighting where current MLLMs still lag human evaluation.

Source
higher educationvalidityhuman agreement

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Andrea Gaggioli, Giuseppe Casaburi, Leonardo Ercolani, Francesco Collova, Pietro Torre, Fabrizio Davide · arXiv · 2025

A useful counterweight to optimistic scoring results: in a real higher-education setting, human-LLM agreement and within-model stability remained weak.

Source
essay scoringfairnessexplainability

Examining the responsible use of zero-shot AI approaches to scoring essays

Matthew S. Johnson, Mo Zhang · Scientific Reports · 2024

Useful bridge between LLM scoring and assessment responsibility: accuracy is treated as only one part of fairness, explainability, privacy, and accountability.

Source
scienceconstructed responserationale

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Lee et al. · Computers and Education: Artificial Intelligence · 2024

Focuses on student-written science responses, making it relevant beyond essays and closer to rubric-based constructed response scoring.

Source
AEShuman alignmentcalibration

Are Large Language Models Good Essay Graders?

Anindita Kundu, Denilson Barbosa · arXiv · 2024

A useful cautionary read: LLM scores can diverge from human raters, especially without careful calibration and review design.

Source
reliabilityconsistencymulti-model

On the Consistency of Automatic Scoring with Large Language Models

Mingfeng Xue, Xingyao Xiao, Yunting Liu, Mark Wilson · Educational and Psychological Measurement · 2026

Directly studies scoring consistency across LLMs, temperatures, and constructed-response datasets, with implications for multi-rater panel design.

Source
Track question

When can AI scoring be accurate enough, fair enough, and explainable enough for real assessment programs?

LLM-as-Judge

LLM-as-judge as its own subdomain

The fast-moving research area where models evaluate open-ended outputs, compare responses, apply rubrics, and expose judge bias.

uncertaintycalibrationstatistics

Bias and Uncertainty in LLM-as-a-Judge Estimation

James Fiedler · arXiv · 2026

Sharpens the statistics behind judge outputs by showing how corrected estimates can still become unreliable when judge quality or calibration shifts across compared models.

Source
reliabilityconformal predictiontransitivity

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Manan Gupta, Dhruv Kumar · arXiv · 2026

Looks beneath aggregate agreement by exposing per-document instability, transitivity failures, and criterion-specific reliability differences.

Source
selection biaspairwise judgingdebiasing

CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges

Haitao Li, Junjie Chen, Qingyao Ai, Zhumin Chu, Yujia Zhou, Qian Dong, Yiqun Liu · ACL · 2025

Gives a concrete inference-time method for reducing option-position and ID-token selection bias in pairwise judge decisions.

Source
judge alignmentvulnerabilitiesleniency bias

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes · GEM · 2025

Shows why simple percent agreement is not enough: judge models can align broadly with humans while still drifting in score scale, leniency, and prompt sensitivity.

Source
surveyjudge reliabilitybias

A Survey on LLM-as-a-Judge

Jiawei Gu et al. · arXiv · 2024

A broad map of judge reliability, bias mitigation, consistency, and deployment challenges.

Source
MT-BenchChatbot Arenabias

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng et al. · NeurIPS Datasets and Benchmarks · 2023

Foundational paper for response judging, pairwise preference, position bias, verbosity bias, and human agreement.

Source
rubricsNLG evaluationhuman alignment

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu et al. · EMNLP · 2023

Shows how form-filling and rubric-style prompts can improve alignment with human judgments for generated text evaluation.

Source
open judgerubricfeedback

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Seungone Kim et al. · ICLR · 2024

Important for custom rubrics: open evaluator models can be trained for fine-grained feedback and score criteria.

Source
position biascalibrationhuman-in-loop

Large Language Models are not Fair Evaluators

Peiyi Wang et al. · ACL · 2024

A direct warning that judge outputs can be manipulated by response order, requiring balanced position and human-in-loop protocols.

Source
fine-tuned judgegeneralizationfairness

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

Hui Huang et al. · arXiv · 2024

Useful for product design: fine-tuned judges may be strong in-domain yet weaker on generalization, fairness, and aspect-specific evaluation.

Source
Track question

How do we make AI judging reliable when the evidence is subjective, long-form, multimodal, or politically consequential?

AI evaluation

Benchmarks, testing, and evaluation design

Work on how AI systems are measured, stress-tested, compared, and audited across capabilities, risks, calibration, and expert knowledge.

rubricspsychometricsevaluation framework

Autorubric: Unifying Rubric-based LLM Evaluation

Delip Rao, Chris Callison-Burch · arXiv · 2026

Connects rubric design, few-shot calibration, ensemble judging, bias mitigation, and psychometric reliability metrics in one practical evaluation framework.

Source
live benchmarkrubric evaluationhigh-stakes

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun · arXiv · 2026

A high-stakes example of live, temporally separated evaluation with case-specific rubrics, useful for thinking about contamination and open-ended expert scoring.

Source
agentsresearch replicationrubrics

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan · arXiv · 2025

Raises the bar for agent evaluation by grading research replication against expert-authored hierarchical rubrics, including a separate judge-evaluation component.

Source
agent benchmarksreward designbest practices

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang · arXiv · 2025

A practical checklist for avoiding reward-design and task-setup mistakes that can overstate or understate agent performance.

Source
expert itemsbenchmark saturationmultimodal

Humanity's Last Exam

Long Phan et al. · arXiv · 2025

A broad, expert-written benchmark aimed at resisting saturation with difficult multimodal, multiple-choice, and short-answer academic questions.

Source
contaminationlive benchmarkobjective scoring

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White et al. · ICLR Spotlight · 2025

A major response to benchmark contamination: frequently updated questions, objective scoring, and harder tasks across math, coding, reasoning, language, instruction following, and data analysis.

Source
HELMtransparencymulti-metric

Holistic Evaluation of Language Models

Percy Liang et al. · arXiv / Stanford CRFM · 2022

Frames evaluation as a multi-metric problem: accuracy, robustness, fairness, bias, calibration, efficiency, and transparency.

Source
BIG-benchcoveragecapability

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

BIG-bench authors · TMLR · 2023

A large collaborative benchmark that treats evaluation as broad task coverage, not a single score.

Source
MMLUsubjectsknowledge

Measuring Massive Multitask Language Understanding

Dan Hendrycks et al. · ICLR · 2021

A canonical benchmark across many subjects, useful when thinking about subject breadth and domain-specific difficulty.

Source
expert itemssciencescalable oversight

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein et al. · arXiv · 2023

Highlights the role of domain experts and difficult, search-resistant questions in evaluating high-skill reasoning.

Source
Track question

What can human assessment borrow from AI evaluation, and what can AI evaluation borrow from psychometrics?