Automated Chest X-ray Report Generation Remains Unsolved
Why this work?
Automated chest X-ray report generation has the potential to substantially reduce radiologist workload and improve clinical efficiency. However, despite rapid progress in vision-language models and large language models (LLMs), the true clinical reliability of these systems remains unclear. Existing evaluations are often inconsistent, rely on limited test sets, or fail to probe generalization across institutions and clinically challenging abnormal cases.
To address these limitations, we present a large-scale, standardized benchmark study through the ReXrank Challenge V1.0, designed to rigorously assess the current state of automated chest X-ray report generation under realistic and clinically meaningful conditions.
What is the ReXrank Challenge V1.0?
The ReXrank Challenge V1.0 is a comprehensive evaluation effort built on ReXGradient, the largest test-only dataset to date for radiology report generation, comprising 10,000 studies from 67 healthcare institutions. The challenge brought together submissions from academia and industry, evaluating 8 new models alongside 16 previously benchmarked state-of-the-art systems under a unified evaluation protocol.
All models were assessed using a diverse set of metrics, ranging from traditional text similarity measures to clinically grounded and LLM-based error detection metrics, enabling a multi-dimensional analysis of model performance.
Key Findings
Automated chest X-ray report generation remains fundamentally unsolved.
Even the best-performing models achieve less than 45% error-free reporting on abnormal studies, highlighting a substantial gap between current AI systems and clinical readiness.Large performance gaps between normal and abnormal studies.
Most models perform well on normal cases (often exceeding 80β90% no-significant-error rates) but struggle significantly with abnormal findings, where clinically important errors are common.Poor cross-institutional generalization.
Model rankings vary dramatically across healthcare sites, indicating that strong performance on one institution does not reliably transfer to others.Evaluation metrics capture complementary but inconsistent signals.
Traditional lexical metrics correlate poorly with LLM-based and clinically focused error metrics, underscoring the need for more clinically aligned evaluation frameworks.
Implications
Our results suggest that current AI systems are not yet suitable for fully autonomous radiology reporting, particularly for abnormal cases. However, they show promise as assistive tools for generating preliminary drafts that can be reviewed and refined by radiologists. The findings emphasize the importance of developing:
- methods specifically targeting abnormality detection,
- strategies for robust cross-institutional generalization, and
- evaluation frameworks that better reflect clinical correctness rather than surface-level textual similarity.
Conclusion
This work provides the most comprehensive benchmark to date for automated chest X-ray report generation and delivers a clear message to the community: while progress is evident, significant challenges remain before clinically reliable deployment is possible. By releasing the ReXrank Challenge V1.0 and ReXGradient dataset, we aim to establish a rigorous foundation for future research and to drive the development of more robust, clinically grounded radiology AI systems.
BibTeX
@inproceedings{zhang2025automated,
title={Automated Chest X-ray Report Generation Remains Unsolved},
author={Zhang, Xiaoman and Acosta, Julian Nicolas and Yang, Xiaoli and Adithan, Subathra and Luo, Luyang and Zhou, Hong-Yu and Miller, Joshua and Huang, Ouwen and Zhou, Zongwei and Hamamci, Ibrahim Ethem and others},
booktitle={Biocomputing 2026: Proceedings of the Pacific Symposium},
pages={236--250},
year={2025},
organization={World Scientific}
}