RadEval: A framework for radiology text evaluation


Why RadEval?
Radiology report generation has seen significant advancements with the advent of large language models (LLMs). However, evaluating the quality of these generated reports remains a complex challenge. Traditional metrics like BLEU and ROUGE often fall short in capturing the clinical relevance and accuracy required in medical contexts. To address this gap, we introduce RadEval, a comprehensive framework designed to evaluate radiology texts using a diverse set of metrics that encompass both linguistic quality and clinical accuracy.
Key Features of RadEval
- Diverse Metric Integration: RadEval consolidates a wide range of evaluation metrics, including classic n-gram overlap measures (BLEU, ROUGE), contextual embeddings (BERTScore), clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1), and advanced LLM-based evaluators (GREEN). This integration allows for a multifaceted assessment of radiology reports.
- Standardized Implementations: We have refined and standardized the implementations of these metrics to ensure consistency and reliability in evaluations across different studies and datasets.
- Domain-Specific Enhancements: RadEval extends the GREEN metric to support multiple imaging modalities using a more lightweight model. Additionally, we have pretrained a domain-specific radiology encoder that demonstrates strong zero-shot retrieval performance, enhancing the evaluation capabilities of the framework.
- Richly Annotated Expert Dataset: We provide a dataset annotated by radiology experts, containing over 450 clinically significant error labels. This dataset serves as a valuable resource for validating and benchmarking evaluation metrics against expert judgment.
- Statistical Testing Tools: RadEval includes tools for statistical testing, enabling researchers to assess the significance of their results and compare different models robustly.
- Baseline Model Evaluations: The framework offers baseline evaluations across multiple publicly available datasets, facilitating reproducibility and benchmarking in radiology report generation research.
Correlation with Radiologist Judgment
We conducted extensive experiments to assess how different metrics correlate with radiologist judgment. Our findings indicate that certain metrics, particularly those incorporating clinical concepts and LLM-based evaluations, show a stronger alignment with expert assessments. This correlation underscores the importance of using clinically informed metrics in evaluating radiology reports.
Getting Started with RadEval
To get started with RadEval, you can access the codebase and documentation on our GitHub repository. The repository includes installation instructions, usage examples, and guidelines for integrating RadEval into your evaluation pipeline. Additionally, the RadEval Expert Dataset is available for download, providing a rich resource for testing and validating evaluation metrics.
Conclusion
RadEval represents a significant step forward in the evaluation of radiology texts, offering a comprehensive and standardized framework that addresses the unique challenges of this domain. By integrating diverse metrics, providing a richly annotated dataset, and facilitating robust benchmarking, RadEval aims to enhance the quality and reliability of radiology report generation research.
BibTeX
@misc{xu2025radevalframeworkradiologytext,
title={RadEval: A framework for radiology text evaluation},
author={Justin Xu and Xi Zhang and Javid Abderezaei and Julie Bauml and Roger Boodoo and Fatemeh Haghighi and Ali Ganjizadeh and Eric Brattain and Dave Van Veen and Zaiqiao Meng and David Eyre and Jean-Benoit Delbrouck},
year={2025},
eprint={2509.18030},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.18030},
}