Automated Chest X-ray Report Generation Remains Unsolved

Automated Chest X-ray Report Generation Remains Unsolved

Dec 22, 2025Β·
Xiaoman Zhang
,
Julian Nicolas Acosta
,
Xiaoli Yang
,
Subathra Adithan
,
Luyang Luo
,
Hong-Yu Zhou
,
Joshua Miller
,
Ouwen Huang
,
Zongwei Zhou
,
Ibrahim Ethem Hamamci
,
Shruthi Bannur
,
Kenza Bouzid
Xi Zhang
Xi Zhang
,
Zaiqiao Meng
,
Aaron Nicolson
,
Bevan Koopman
,
Inhyeok Baek
,
Hanbin Ko
,
Mercy Prasanna Ranjit
,
Shaury Srivastav
,
Sriram Gnana Sambanthan
,
Pranav Rajpurkar
Β· 3 min read
Abstract
Accurate interpretation of chest radiograph images and generation of narrative reports is essential for patient care but places a heavy burden on radiologists and clinical experts. While AI models for automated report generation show promise, standardized evaluation frameworks remain limited. Here we present the ReXrank Challenge V1.0, a competition in the generation of chest radiograph reports utilizing ReXGradient, the largest test dataset consisting of 10,000 studies across 67 sites. The challenge attracted diverse participants from academic institutions, industry, and independent research teams, resulting in 8 new submissions alongside 16 state-of-the-art models previously benchmarked. Through comprehensive evaluation using multiple metrics, we analyzed model performance across various dimensions: differences between normal and abnormal studies, generalization capabilities across healthcare sites, and error rates in identifying clinical findings. This benchmark reveals that automated chest X-ray report generation remains fundamentally unsolved, with significant performance gaps between normal and abnormal studies, where even top-performing models achieve less than 45% error-free reporting on abnormal cases, and substantial variability across healthcare institutions, indicating that robust, clinically-ready systems require continued development before widespread deployment.
Type
Publication
Pacific Symposium on Biocomputing 2026
For the latest updates and details, visit the ReXrank Challenge Website.

Why this work?

Automated chest X-ray report generation has the potential to substantially reduce radiologist workload and improve clinical efficiency. However, despite rapid progress in vision-language models and large language models (LLMs), the true clinical reliability of these systems remains unclear. Existing evaluations are often inconsistent, rely on limited test sets, or fail to probe generalization across institutions and clinically challenging abnormal cases.

To address these limitations, we present a large-scale, standardized benchmark study through the ReXrank Challenge V1.0, designed to rigorously assess the current state of automated chest X-ray report generation under realistic and clinically meaningful conditions.

What is the ReXrank Challenge V1.0?

The ReXrank Challenge V1.0 is a comprehensive evaluation effort built on ReXGradient, the largest test-only dataset to date for radiology report generation, comprising 10,000 studies from 67 healthcare institutions. The challenge brought together submissions from academia and industry, evaluating 8 new models alongside 16 previously benchmarked state-of-the-art systems under a unified evaluation protocol.

All models were assessed using a diverse set of metrics, ranging from traditional text similarity measures to clinically grounded and LLM-based error detection metrics, enabling a multi-dimensional analysis of model performance.

Key Findings

  • Automated chest X-ray report generation remains fundamentally unsolved.
    Even the best-performing models achieve less than 45% error-free reporting on abnormal studies, highlighting a substantial gap between current AI systems and clinical readiness.

  • Large performance gaps between normal and abnormal studies.
    Most models perform well on normal cases (often exceeding 80–90% no-significant-error rates) but struggle significantly with abnormal findings, where clinically important errors are common.

  • Poor cross-institutional generalization.
    Model rankings vary dramatically across healthcare sites, indicating that strong performance on one institution does not reliably transfer to others.

  • Evaluation metrics capture complementary but inconsistent signals.
    Traditional lexical metrics correlate poorly with LLM-based and clinically focused error metrics, underscoring the need for more clinically aligned evaluation frameworks.

Implications

Our results suggest that current AI systems are not yet suitable for fully autonomous radiology reporting, particularly for abnormal cases. However, they show promise as assistive tools for generating preliminary drafts that can be reviewed and refined by radiologists. The findings emphasize the importance of developing:

  • methods specifically targeting abnormality detection,
  • strategies for robust cross-institutional generalization, and
  • evaluation frameworks that better reflect clinical correctness rather than surface-level textual similarity.

Conclusion

This work provides the most comprehensive benchmark to date for automated chest X-ray report generation and delivers a clear message to the community: while progress is evident, significant challenges remain before clinically reliable deployment is possible. By releasing the ReXrank Challenge V1.0 and ReXGradient dataset, we aim to establish a rigorous foundation for future research and to drive the development of more robust, clinically grounded radiology AI systems.

BibTeX

@inproceedings{zhang2025automated,
  title={Automated Chest X-ray Report Generation Remains Unsolved},
  author={Zhang, Xiaoman and Acosta, Julian Nicolas and Yang, Xiaoli and Adithan, Subathra and Luo, Luyang and Zhou, Hong-Yu and Miller, Joshua and Huang, Ouwen and Zhou, Zongwei and Hamamci, Ibrahim Ethem and others},
  booktitle={Biocomputing 2026: Proceedings of the Pacific Symposium},
  pages={236--250},
  year={2025},
  organization={World Scientific}
}