Skip to content
Evalysis

Open-the-box scoring,with an SME report.

Try the cloud path with a single student response or a batch upload.The workflow is built around expert review: rubric alignment, subject reasoning, score rationale, feedback, and escalation notes.

Try a sample
Student work
No files attached
Upload or paste
Drop files here

From messy submissions to controlled scoring.

Evalysis is organized around four product decisions: what student work must be handled, which subject rules apply, how scores are challenged, and where the scoring run can legally operate.

01

Multimodal by default

Student work arrives as essays, handwriting, math notation, diagrams, lab photos, speech, short answers, tables, and mixed-language responses. Evalysis turns that into structured scoring inputs.

  • Text and handwriting
  • Math notation and proof structure
  • Images, diagrams, speech, and languages
02

Broad subject coverage

Score school, exam, tutoring, and professional subjects with item-specific rubrics and reporting views, without building a custom mini-product for every format.

  • Humanities and social sciences
  • STEM and lab work
  • Languages, arts, business, medicine, law, and vocational tracks
03

Multi-agent review

Independent raters, critics, adjudicators, calibrators, fairness reviewers, and audit loggers create a defensible scoring chain.

  • Double-rating and challenge
  • Confidence-based routing
  • Replayable decision trail
04

Deployment for the stakes

Open-the-box cloud for pilots and tutoring schools; private cloud or on-prem for high-stakes exams with strict data residency.

  • Zero-shot
  • Human-in-loop
  • Fine-tuned alignment

Specialist panels for high-stakes scoring.

setup · multimodal intake · blind scoring · adjudication · QA
Setup
Setup agents read the rubric, build the anchor set, and link items to your knowledge graph.
Multimodal
Perception agents decode handwriting, speech, math notation, and lab imagery into a clean canonical form.
Rater Panel
The rater panel scores blindly — two independent personas, plus separate dimensions for conventions and process.
Adjudication
On disagreement, a critic challenges, an adjudicator decides, and a calibrator reports confidence + routing.
QA & Fairness
QA agents inject validity papers mid-batch, monitor drift, audit subgroups, and back-read random samples.
Setup
stage 01
Rubric Author
Item Writer / Rangefinder
Anchor Curator
Anchor Approver
Item Profiler
Content Specialist
Multimodal
stage 02
Handwriting Reader
Diagrams & written work
Speech Reader
Oral responses
Experiment Vision
Lab Observer
Math Normalizer
Symbols & notation
Rater Panel
stage 03
Rater A
Scorer 1 (strict)
Rater B
Scorer 2 (lenient)
Convention Rater
Language & Conventions
Process Rater
Step-by-step Reasoner
Adjudication
stage 04
Critic
Devil's Advocate
Adjudicator
Table Leader
Calibrator
Chief Reader
QA & Fairness
stage 05
Validity Injector
Calibration Master
Drift Detector
Score-Distribution Monitor
Bias Auditor
Fairness Reviewer
Backreader
QC Sampler

Multimodal scoring across what students actually do on a test.

Essay & short answer

writing evidence
  • Thesis & evidence chain
  • Rhetorical structure (claim → warrant → backing)
  • Domain-specific vocabulary recall
  • Cohesion, register, conventions
  • Legitimacy / off-task detection
Agents engaged
Rater ARater BConvention RaterCriticAdjudicator
Essay · evidence-based response
rubric · 0–5 ECR
Item #482

The author argues that automated scoring is necessary because human capacity cannot scale with the new constructed-response volume. Two pieces of textual evidence are offered, but the second claim conflates cost with reliability, which the rubric treats as a partial-credit issue.

Development
3 / 3
Conventions
2 / 2
Overall
4 / 5

Built for broad curriculum coverage, not a narrow essay demo.

A school or exam operator may start with one workflow and expand across academic, professional, vocational, language, and oral-response subjects. Below is a concrete sample of the spectrum.

Subject atlas

Coverage by submission type.

The useful question is not whether a subject is on a static list. It is whether the program can handle the work students submit and apply the right rubric, anchors, and review path.

72
sample labels
6
families
5
submission modes
Rubric + modality matrix

Where each family expands

sample pilot configuration
Written
essay · SCR · DBQ
Handwritten
steps · proofs
Visual
diagram · lab photo
Oral
speaking · listening
Structured
table · code · file
Language & writing
argument writingsource synthesisliterary analysis
essays, short response, conventions
scanned short answers
presentation rubrics
Mathematics
algebrageometry proofsstatistics
explanations
worked steps, proofs
graphs, constructions
symbolic checks
Science & lab
physicschemistrybiology labs
CER response
calculations
lab photos, diagrams
tables, graphs
Humanities & social science
history DBQgeographyeconomics
argument, case analysis
paper booklet scans
maps, sources
Professional & vocational
lawmedicineteacher certification
scenario judgment
workplace artifacts
interview exams
forms, tables, logs
Arts, languages & oral exams
translationdebateportfolio critique
reflection, commentary
portfolio evidence
speaking, listening
Example mix
Language & writing
22%
Mathematics
18%
Science & lab
17%
Humanities & social science
15%
Professional & vocational
14%
Arts, languages & oral exams
14%
Concrete subject labels
Language & writing
English compositionargument writingsource-based synthesisliterary analysisreading short responsegrammar and usageESL/EAL writingChinese composition+4 more
Mathematics
arithmeticpre-algebraalgebrageometry proofstrigonometryprecalculuscalculusstatistics+4 more
Science & lab
biologychemistryphysicsearth scienceenvironmental sciencelab notebooksclaim-evidence-reasoningexperimental design+4 more
Humanities & social science
world historyUS historygeographyeconomicscivicspsychologysociologyphilosophy+4 more
Professional & vocational
business writingaccounting explanationslaw hypotheticalsnursing scenariosteacher certificationsafety procedurestechnical writingcoding explanations+4 more
Arts, languages & oral exams
speaking testslistening responsetranslationinterpretationmusic theoryart critiquemedia studiesdrama reflection+4 more

Multi-language scoring and feedback

Rubrics, anchors, examples, and feedback can be localized. The goal is not merely translation; it is alignment to the scoring culture, language background, and classroom context of the program.

EnglishChineseSpanishFrenchJapaneseKoreanArabicGermanPortuguesebilingual feedbacklocalized rubricsregional examples

Fast cloud pilots, controlled on-prem scoring.

Different exams have different stakes. Evalysis supports both open-the-box cloud use and controlled on-prem deployments, with three onboarding modes that decide how much human alignment happens before scoring at scale.

Decision axis 01

Where it runs

Choose the deployment environment first. This determines data custody, integration boundaries, and operational controls.

Open the box

Cloud

Pilots, tutoring schools, formative scoring

Fastest path for pilots, tutoring schools, internal benchmarks, and formative feedback. Start in the Cloud Trial with a mock example or upload student work first, then review the inferred setup before scoring.

Managed isolation

Private cloud / VPC

Districts, institutions, assessment operators

For districts, institutions, and assessment operators that need SSO, role controls, private storage, API integration, and stricter data boundaries.

High-stakes control

On-prem

Sensitive exams and local audit custody

A major option for high-stakes exams. Keep sensitive responses inside your network, run scoring locally, and retain customer-controlled audit artifacts.

Decision axis 02

How it aligns

After the environment is chosen, pick the onboarding path. This is about rubric alignment, teacher input, and confidence thresholds.

No training samples

Zero-shot

Evalysis reads the rubric and grades immediately. Best for quick pilots, low-stakes practice, and formative feedback where speed matters.

Teacher calibration

Human-in-loop

The system selects representative samples for teachers or scoring leaders to label, aligns to those decisions, then grades the rest with escalation for uncertain cases.

Formal alignment

Fine-tuned

For formal alignment. Tune on approved samples and anchors, then receive a comprehensive report with item behavior, agreement, confidence, fairness, and routing thresholds.

Get started

Run a pilot on your own scoring data.

Bring a rubric, a sample set, and the deployment constraints that matter. We return sample traces, alignment recommendations, and the reporting shape your scoring team can review before scale.

FERPA-aligned · data isolationVPC / on-prem optionAudit-by-replay built in