Skip to content
Evalysis
Evalysis Library

AI evaluation benchmarks are starting to look like assessment programs.

The best benchmarks are no longer just leaderboards. They use expert items, contamination controls, rubric scoring, auditability, and multiple measures of performance.

What this topic means for scoring teams.

What benchmark design contributes

Modern benchmarks highlight task coverage, contamination resistance, expert difficulty, live updates, multi-metric reporting, and careful judge design. Those ideas are directly useful for assessment teams evaluating AI scoring.

What assessment contributes back

Psychometrics brings reliability, validity, rater agreement, fairness checks, standard setting, and audit expectations. AI evaluation becomes stronger when it borrows those habits.

A shared evaluation language

Evalysis uses this overlap to treat student scoring and AI judging as related problems: both need clear tasks, rubrics, calibrated judges, evidence records, and honest limits.

Papers behind the topic.

These papers are the anchor shelf for this topic. The Library keeps the citation list short enough to be useful and close enough to assessment work to shape product decisions.

rubricspsychometricsevaluation framework

Autorubric: Unifying Rubric-based LLM Evaluation

Delip Rao, Chris Callison-Burch · arXiv · 2026

Connects rubric design, few-shot calibration, ensemble judging, bias mitigation, and psychometric reliability metrics in one practical evaluation framework.

Source
live benchmarkrubric evaluationhigh-stakes

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun · arXiv · 2026

A high-stakes example of live, temporally separated evaluation with case-specific rubrics, useful for thinking about contamination and open-ended expert scoring.

Source
agentsresearch replicationrubrics

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan · arXiv · 2025

Raises the bar for agent evaluation by grading research replication against expert-authored hierarchical rubrics, including a separate judge-evaluation component.

Source
agent benchmarksreward designbest practices

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang · arXiv · 2025

A practical checklist for avoiding reward-design and task-setup mistakes that can overstate or understate agent performance.

Source
expert itemsbenchmark saturationmultimodal

Humanity's Last Exam

Long Phan et al. · arXiv · 2025

A broad, expert-written benchmark aimed at resisting saturation with difficult multimodal, multiple-choice, and short-answer academic questions.

Source
contaminationlive benchmarkobjective scoring

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White et al. · ICLR Spotlight · 2025

A major response to benchmark contamination: frequently updated questions, objective scoring, and harder tasks across math, coding, reasoning, language, instruction following, and data analysis.

Source
HELMtransparencymulti-metric

Holistic Evaluation of Language Models

Percy Liang et al. · arXiv / Stanford CRFM · 2022

Frames evaluation as a multi-metric problem: accuracy, robustness, fairness, bias, calibration, efficiency, and transparency.

Source
BIG-benchcoveragecapability

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

BIG-bench authors · TMLR · 2023

A large collaborative benchmark that treats evaluation as broad task coverage, not a single score.

Source
MMLUsubjectsknowledge

Measuring Massive Multitask Language Understanding

Dan Hendrycks et al. · ICLR · 2021

A canonical benchmark across many subjects, useful when thinking about subject breadth and domain-specific difficulty.

Source
expert itemssciencescalable oversight

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein et al. · arXiv · 2023

Highlights the role of domain experts and difficult, search-resistant questions in evaluating high-skill reasoning.

Source

Product context

Evalysis connects this research to practical scoring workflows: rubric setup, multimodal intake, judge panels, confidence routing, validation reports, and human review.