Libra

Abstract

Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. While multimodal large language models (MLLMs) align with pre-trained vision encoders to enhance visual-language understanding, most existing methods rely on single-image analysis or rule-based heuristics to process multiple images, failing to fully leverage temporal information in multi-modal medical datasets. In this paper, we introduce Libra, a temporal-aware MLLM tailored for chest X-ray report generation. Libra combines a radiology-specific image encoder with a novel Temporal Alignment Connector (TAC), designed to accurately capture and integrate temporal differences between paired current and prior images. Extensive experiments on the MIMIC-CXR dataset demonstrate that Libra establishes a new state-of-the-art benchmark among similarly scaled MLLMs, setting new standards in both clinical relevance and lexical accuracy.

Demo

Explore the capabilities of Libra with our interactive demo.

Why Libra?

Temporal hallucination is a critical challenge in radiology report generation (RRG). Traditional multimodal large language models (MLLMs) struggle to integrate prior images correctly, often generating:

Spurious references to nonexistent prior studies (Single-image case).
Inaccurate interpretations of disease progression (Temporal-image case).

Libra addresses these limitations by integrating a Temporal Alignment Connector (TAC) to improve temporal awareness, ensuring:

Prior studies are correctly referenced only when available.
Hallucinated references are eliminated, avoiding misleading reports.
Temporal changes are accurately captured, ensuring clinically meaningful outputs.

Overview

We propose Libra (Leveraging Temporal Images for Biomedical Radiology Analysis), a novel framework tailored for radiology report generation (RRG) that incorporates temporal change information to address the challenges of interpreting medical images effectively.

Libra leverages RAD-DINO, a pre-trained visual transformer, as its image encoder to generate robust and scalable image features. These features are further refined by a Temporal Alignment Connector (TAC), a key innovation in Libra's architecture. The TAC comprises:

Layerwise Feature Extractor (LFE): Captures high-granularity image feature embeddings from the encoder.
Temporal Fusion Module (TFM): Integrates temporal references from prior studies to enhance temporal awareness and reasoning.

These refined features are fed into Meditron, a specialised medical large language model (LLM), to generate comprehensive, temporally-aware radiology reports. Libra’s modular design seamlessly integrates state-of-the-art open-source pre-trained models for both image and text, aligning them through a temporal-aware adapter to ensure robust cross-modal reasoning and understanding.

Through a two-stage training strategy, Libra demonstrates the powerful potential of multimodal large language models (MLLMs) in specialised radiology applications. Extensive experiments on the MIMIC-CXR dataset highlight Libra's performance, setting a new state-of-the-art benchmark among models of the same parameter scale.

Key Contributions

Temporal Awareness: Libra captures and synthesizes temporal changes in medical images, addressing the challenge of handling prior study citations in RRG tasks.
Innovative Architecture: The Temporal Alignment Connector (TAC) ensures high-granularity feature extraction and temporal integration, significantly enhancing cross-modal reasoning capabilities.
State-of-the-Art Performance: Libra achieves outstanding results on the MIMIC-CXR dataset, outperforming existing MLLMs in both accuracy and temporal reasoning.
Libra Repository: Our code space provides a public and detailed implementation of Libra, facilitating reproducibility and further research in the field, and integrates both training and evaluation processes, ensuring a streamlined and efficient workflow for developing and testing radiology report generation models.

Experimental Results

Performance Analysis

BibTeX


@misc{zhang2024libraleveragingtemporalimages,
  title={Libra: Leveraging Temporal Images for Biomedical Radiology Analysis}, 
  author={Xi Zhang and Zaiqiao Meng and Jake Lever and Edmond S. L. Ho},
  year={2024},
  eprint={2411.19378},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2411.19378}, 
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.