Constructed-Response Scoring with AI

What this topic means for scoring teams.

Why constructed responses are different

A constructed response can be partly right, use unexpected evidence, skip a step, show reasoning in a diagram, or answer with mixed notation and prose. The scoring system needs to evaluate the path, not only the final answer.

Partial credit needs a trace

Rubrics often award points for evidence selection, explanation quality, procedure, units, proof structure, or misconception absence. AI scoring should expose those criterion-level decisions so teachers and scoring leaders can inspect them.

Where Evalysis fits

Evalysis normalizes handwriting, typed text, PDFs, diagrams, speech, and structured files into a response package, then scores with subject-specific agents and review gates.

Papers behind the topic.

These papers are the anchor shelf for this topic. The Library keeps the citation list short enough to be useful and close enough to assessment work to shape product decisions.

multi-agentrubric alignmentinterpretability

AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue Sun, Ninghao Liu, Xiaoming Zhai · AAAI · 2026

Moves automated scoring away from one-shot grading by extracting rubric-relevant components before assigning scores, which is directly relevant to interpretable, audit-ready scoring workflows.

Source

AESbenchmarksrater bias

Exploring potential of large language models for automated essay scoring in education

Nimra Mughal, Ali Shariq Imran, Sher Muhammad Daudpota, Zenun Kastrati, Waheed Noor · Discover Artificial Intelligence · 2026

A current open-access AES study comparing GPT and Gemini on benchmark and classroom data, useful for tracking how LLM scoring performs under real rubric and rater-bias conditions.

Source

multimodalAES benchmarkwriting traits

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu · Findings of ACL · 2025

Adds a multimodal AES benchmark across lexical, sentence, and discourse traits, highlighting where current MLLMs still lag human evaluation.

Source

higher educationvalidityhuman agreement

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Andrea Gaggioli, Giuseppe Casaburi, Leonardo Ercolani, Francesco Collova, Pietro Torre, Fabrizio Davide · arXiv · 2025

A useful counterweight to optimistic scoring results: in a real higher-education setting, human-LLM agreement and within-model stability remained weak.

Source

essay scoringfairnessexplainability

Examining the responsible use of zero-shot AI approaches to scoring essays

Matthew S. Johnson, Mo Zhang · Scientific Reports · 2024

Useful bridge between LLM scoring and assessment responsibility: accuracy is treated as only one part of fairness, explainability, privacy, and accountability.

Source

scienceconstructed responserationale

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Lee et al. · Computers and Education: Artificial Intelligence · 2024

Focuses on student-written science responses, making it relevant beyond essays and closer to rubric-based constructed response scoring.

Source

AEShuman alignmentcalibration

Are Large Language Models Good Essay Graders?

Anindita Kundu, Denilson Barbosa · arXiv · 2024

A useful cautionary read: LLM scores can diverge from human raters, especially without careful calibration and review design.

Source

reliabilityconsistencymulti-model

On the Consistency of Automatic Scoring with Large Language Models

Mingfeng Xue, Xingyao Xiao, Yunting Liu, Mark Wilson · Educational and Psychological Measurement · 2026

Directly studies scoring consistency across LLMs, temperatures, and constructed-response datasets, with implications for multi-rater panel design.

Source

Product context

Evalysis connects this research to practical scoring workflows: rubric setup, multimodal intake, judge panels, confidence routing, validation reports, and human review.

Workflows Math proof grading

Constructed-response scoring is where AI assessment gets real.

What this topic means for scoring teams.

Why constructed responses are different

Partial credit needs a trace

Where Evalysis fits

Papers behind the topic.

AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Exploring potential of large language models for automated essay scoring in education

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Examining the responsible use of zero-shot AI approaches to scoring essays

Applying Large Language Models and Chain-of-Thought for Automatic Scoring

Are Large Language Models Good Essay Graders?

On the Consistency of Automatic Scoring with Large Language Models

Related library topics

Product context