CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

Mar 18, 2025·

Yukyung Lee

Joonghoon Kim

Jaehee Kim

Hyowon Cho

Jaewook Kang

Pilsung Kang

Najoung Kim

· 0 min read

PDF Cite

Abstract

Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments, improving the average correlation with human judgments by 0.10. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.

Type

Preprint

Publication

Extended version: Under Review | Workshop version: HEAL@CHI2024

Last updated on Mar 18, 2025

Authors

Yukyung Lee

Postdoctoral Associate

Navigating the Path of Writing: Outline-guided Text Generation with Large Language Models Feb 12, 2025 →