RadEval: A framework for radiology text evaluation

RadEval: A framework for radiology text evaluation

Sep 22, 2025ยท
Justin Xu
Xi Zhang
Xi Zhang
,
Javid Abderezaei
,
Julie Bauml
,
Roger Boodoo
,
Fatemeh Haghighi
,
Ali Ganjizadeh
,
Eric Brattain
,
Dave Van Veen
,
Zaiqiao Meng
,
David Eyre
,
Jean-Benoit Delbrouck
ยท 3 min read
Abstract
We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics – from classic nโ€‘gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLMโ€‘based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder – demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.
Type
Publication
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
For the latest updates and details, visit the RadEval Project Website.

Why RadEval?

Radiology report generation has seen significant advancements with the advent of large language models (LLMs). However, evaluating the quality of these generated reports remains a complex challenge. Traditional metrics like BLEU and ROUGE often fall short in capturing the clinical relevance and accuracy required in medical contexts. To address this gap, we introduce RadEval, a comprehensive framework designed to evaluate radiology texts using a diverse set of metrics that encompass both linguistic quality and clinical accuracy.

Key Features of RadEval

  • Diverse Metric Integration: RadEval consolidates a wide range of evaluation metrics, including classic n-gram overlap measures (BLEU, ROUGE), contextual embeddings (BERTScore), clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1), and advanced LLM-based evaluators (GREEN). This integration allows for a multifaceted assessment of radiology reports.
  • Standardized Implementations: We have refined and standardized the implementations of these metrics to ensure consistency and reliability in evaluations across different studies and datasets.
  • Domain-Specific Enhancements: RadEval extends the GREEN metric to support multiple imaging modalities using a more lightweight model. Additionally, we have pretrained a domain-specific radiology encoder that demonstrates strong zero-shot retrieval performance, enhancing the evaluation capabilities of the framework.
  • Richly Annotated Expert Dataset: We provide a dataset annotated by radiology experts, containing over 450 clinically significant error labels. This dataset serves as a valuable resource for validating and benchmarking evaluation metrics against expert judgment.
  • Statistical Testing Tools: RadEval includes tools for statistical testing, enabling researchers to assess the significance of their results and compare different models robustly.
  • Baseline Model Evaluations: The framework offers baseline evaluations across multiple publicly available datasets, facilitating reproducibility and benchmarking in radiology report generation research.

Correlation with Radiologist Judgment

We conducted extensive experiments to assess how different metrics correlate with radiologist judgment. Our findings indicate that certain metrics, particularly those incorporating clinical concepts and LLM-based evaluations, show a stronger alignment with expert assessments. This correlation underscores the importance of using clinically informed metrics in evaluating radiology reports.

Getting Started with RadEval

To get started with RadEval, you can access the codebase and documentation on our GitHub repository. The repository includes installation instructions, usage examples, and guidelines for integrating RadEval into your evaluation pipeline. Additionally, the RadEval Expert Dataset is available for download, providing a rich resource for testing and validating evaluation metrics.

Conclusion

RadEval represents a significant step forward in the evaluation of radiology texts, offering a comprehensive and standardized framework that addresses the unique challenges of this domain. By integrating diverse metrics, providing a richly annotated dataset, and facilitating robust benchmarking, RadEval aims to enhance the quality and reliability of radiology report generation research.

BibTeX

@misc{xu2025radevalframeworkradiologytext,
      title={RadEval: A framework for radiology text evaluation}, 
      author={Justin Xu and Xi Zhang and Javid Abderezaei and Julie Bauml and Roger Boodoo and Fatemeh Haghighi and Ali Ganjizadeh and Eric Brattain and Dave Van Veen and Zaiqiao Meng and David Eyre and Jean-Benoit Delbrouck},
      year={2025},
      eprint={2509.18030},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.18030}, 
}