CCD: Mitigating Hallucinations in Radiology MLLMs via
Clinical Contrastive Decoding


1Information Retrieval Group
2AI4BioMed Lab
School of Computing Science, University of Glasgow, UK

🔥[NEWS!]
[30 Sep 2025] 🗂️ Processed test data for the MIMIC-CXR, IU-Xray, CheXpert Plus RRG task and Medical-CXR-VQA are now available on Hugging Face Collections.
[27 Sep 2025] ⛳ Our preprint is now live on arXiv — check it out for details.

Abstract

Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task‑specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.

“It’s better to be roughly right than precisely wrong.”

Carveth Read
Logic: Deductive and Inductive

Why CCD?

Clinical hallucination cases mitigated by CCD

Radiology MLLMs remain vulnerable to prompt-induced hallucinations when clinical sections contain counterfactual details or ambiguous guidance. The figure above contrasts baseline predictions with CCD-enabled outputs across report generation and question answering tasks. Red highlights mark unsupported findings that can compromise patient care, while blue text reflects misleading prompt context the model must resist.

Clinical Contrastive Decoding mitigates these risks at inference time. Rather than retraining models or relying on retrieval corpora, CCD injects trustworthy, image-grounded signals distilled from specialist expert models. The result is a decoding policy that maintains fluency yet stays faithful to the radiograph.

Overview

Clinical Contrastive Decoding (CCD) is a plug-and-play inference framework designed to reduce medical hallucinations in radiology MLLMs. It introduces structured clinical supervision from expert models (e.g., DenseNet or MedSigLIP) at decoding time, without modifying model weights or requiring external retrieval.

Given a chest radiograph, the expert model predicts symptom-level probabilities across 14 CheXpert categories. CCD integrates this signal through a dual-stage logit refinement strategy:

  • Symptom-grounded Contrastive Decoding (SCD): constructs an anchor prompt using high-confidence findings (e.g., “Atelectasis, Cardiomegaly”) and generates a contrastive logits path conditioned on this prompt. The final logits are a weighted interpolation between the anchor-conditioned and original paths, encouraging the model to mention supported findings and suppress false negatives.
  • Expert-informed Contrastive Decoding (ECD): transforms expert probabilities into token-level logit biases via log-odds conversion. These biases are injected into the logits from the first stage, softly penalising unsupported findings and reducing false positives.

This two-stage mechanism enables CCD to progressively guide generation with both symbolic supervision (via anchor prompts) and probabilistic constraints (via confidence scores), achieving robust improvements in both report generation and VQA. CCD is fully model-agnostic and integrates seamlessly with state-of-the-art radiology MLLMs such as MAIRA-2, Libra, LLaVA-Rad, and LLaVA-Med.

Unlike prior contrastive decoding approaches that rely on perturbed visual or textual inputs, CCD leverages clinically grounded signals from expert models to provide task-specific and symptom-level control during generation.

Key Contributions

  • Empirical Insight: We systematically analyse prompt-induced hallucinations in radiology MLLMs and demonstrate that noisy clinical sections—such as irrelevant or contradictory prompts—can trigger unsupported findings across multiple datasets.
  • Inference-time Framework: We propose Clinical Contrastive Decoding (CCD), a dual-stage inference strategy that incorporates expert-derived labels as anchor prompts and probabilistic logits adjustments. CCD requires no retraining, architectural changes, or external retrieval modules.
  • Consistent Gains: Extensive experiments on MIMIC-CXR, IU-Xray, and CheXpert Plus show that CCD improves RadGraph-F1 by up to +17% and enhances VQA accuracy—all without modifying model weights or architecture.

Empirical Analyses

To understand how hallucinations arise in radiology MLLMs, we conduct a systematic study on the MIMIC-CXR dataset, evaluating how different clinical sections affect report generation. The table below quantifies the impact of appending specific sections (e.g., Indication, Technique, Comparison) on both lexical and clinical metrics.

Medical Hallucination Analysis Table

Hallucination Drivers: Clinical Context Sensitivity. Our results show that appending different clinical sections leads to inconsistent—and sometimes harmful—effects on generation. For example, adding History or Technique can slightly improve fluency due to stylistic overlap with report narratives. However, Comparison consistently harms performance across all metrics (e.g., BERTScore ↓ 8.12), as it often references prior images or temporal changes that are absent in the current input. This mismatch between prompt and image causes the model to hallucinate unsupported findings.

Clinically, we observe that the inclusion of misleading or overly strong prompts can cause the model to overlook subtle image features (e.g., Pleural Effusion, Atelectasis), leading to both over-diagnosis and under-detection. These findings highlight the limitations of naïvely using full report context as generation input.

Motivation for CCD: These empirical observations motivate the need for a decoding-time solution that selectively integrates structured, image-grounded signals—rather than relying on potentially noisy clinical text prompts. Our proposed CCD framework addresses this by incorporating expert-derived labels and probabilities to guide generation in a controlled and clinically consistent manner.

Experimental Results on RRG

Additional Experimental Results on RRG

Experimental Results on VQA

Ablation Studies

Why Balanced Guidance?

Effect of guidance strength

The effectiveness of CCD relies on the balance between two guidance signals:

  • α (Symptom-grounded guidance): influences the model via expert-derived anchor prompts.
  • β (Expert-informed guidance): adjusts token-level logits using expert probability scores.

As shown on the left, RadGraph-F1 reaches its peak when both α and β are set to 0.5, demonstrating the value of balanced guidance. Excessive reliance on either signal can lead to overcorrection or verbosity.

CCD benefits most from combining symbolic prompts and probabilistic constraints—striking the right trade-off between precision and fluency in radiology report generation.

BibTeX

@misc{zhang2025ccdmitigatinghallucinationsradiology,
      title={CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding}, 
      author={Xi Zhang and Zaiqiao Meng and Jake Lever and Edmond S. L. Ho},
      year={2025},
      eprint={2509.23379},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.23379}, 
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.