Skip to content
Evalysis
Evalysis Library

Rubric-based LLM evaluation connects AI benchmarks and assessment.

The same ideas used to evaluate agents and model outputs can help score student work, as long as rubrics are calibrated, audited, and tied to human standards.

What this topic means for scoring teams.

Why rubrics are the bridge

Rubrics turn subjective judgment into named criteria. In AI evaluation, they define what makes an answer useful. In education, they define what evidence earns credit. LLM judges become more defensible when those criteria are explicit.

From benchmarks to scoring programs

AI benchmarks increasingly use expert-authored rubrics and judge-evaluation layers. Assessment programs can borrow those ideas while adding stronger human agreement, fairness, and operational review requirements.

The practical design pattern

Start with the rubric, collect anchors, run independent judge passes, compare with humans, calibrate thresholds, and publish the limits of automation alongside the scores.

Papers behind the topic.

These papers are the anchor shelf for this topic. The Library keeps the citation list short enough to be useful and close enough to assessment work to shape product decisions.

rubricspsychometricsevaluation framework

Autorubric: Unifying Rubric-based LLM Evaluation

Delip Rao, Chris Callison-Burch · arXiv · 2026

Connects rubric design, few-shot calibration, ensemble judging, bias mitigation, and psychometric reliability metrics in one practical evaluation framework.

Source
live benchmarkrubric evaluationhigh-stakes

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun · arXiv · 2026

A high-stakes example of live, temporally separated evaluation with case-specific rubrics, useful for thinking about contamination and open-ended expert scoring.

Source
agentsresearch replicationrubrics

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan · arXiv · 2025

Raises the bar for agent evaluation by grading research replication against expert-authored hierarchical rubrics, including a separate judge-evaluation component.

Source
agent benchmarksreward designbest practices

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang · arXiv · 2025

A practical checklist for avoiding reward-design and task-setup mistakes that can overstate or understate agent performance.

Source
expert itemsbenchmark saturationmultimodal

Humanity's Last Exam

Long Phan et al. · arXiv · 2025

A broad, expert-written benchmark aimed at resisting saturation with difficult multimodal, multiple-choice, and short-answer academic questions.

Source
contaminationlive benchmarkobjective scoring

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White et al. · ICLR Spotlight · 2025

A major response to benchmark contamination: frequently updated questions, objective scoring, and harder tasks across math, coding, reasoning, language, instruction following, and data analysis.

Source
HELMtransparencymulti-metric

Holistic Evaluation of Language Models

Percy Liang et al. · arXiv / Stanford CRFM · 2022

Frames evaluation as a multi-metric problem: accuracy, robustness, fairness, bias, calibration, efficiency, and transparency.

Source
BIG-benchcoveragecapability

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

BIG-bench authors · TMLR · 2023

A large collaborative benchmark that treats evaluation as broad task coverage, not a single score.

Source
MMLUsubjectsknowledge

Measuring Massive Multitask Language Understanding

Dan Hendrycks et al. · ICLR · 2021

A canonical benchmark across many subjects, useful when thinking about subject breadth and domain-specific difficulty.

Source
expert itemssciencescalable oversight

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein et al. · arXiv · 2023

Highlights the role of domain experts and difficult, search-resistant questions in evaluating high-skill reasoning.

Source

Product context

Evalysis connects this research to practical scoring workflows: rubric setup, multimodal intake, judge panels, confidence routing, validation reports, and human review.