MedFM-Robust: Benchmarking Robustness
of Medical Foundation Models

Xiangxiang Cui1, Tianjin Huang2, Yifang Wang3, Lijie Hu4, and Lu Yin5

1Beijing Normal University, China
2University of Exeter, United Kingdom
3University College London, United Kingdom
4Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates
5University of Surrey, United Kingdom

l.yin@surrey.ac.uk

This work has been early accepted by MICCAI 2026 (top 9%).

Code Paper Results
MedFM-Robust Framework

Fig. 1. (a) Overview of our robustness evaluation framework. We generate SSIM-calibrated perturbations across five severity levels, combining base corruptions with modality-specific artifacts. We benchmark three Med-VLMs and two SAM-based segmentation models under a unified protocol, and investigate multiple fine-tuning strategies across VQA, captioning, visual grounding, and segmentation tasks.


Key Results

Fig. 2. (a) Traditional clean-image evaluation pipeline. (b) Our robustness benchmark applies modality-adaptive perturbations before the encoder and evaluates both Med-VLM tasks and segmentation under matched settings. (c) Metrics comparison: IoU drop closely tracks Dice drop, and representative fatal perturbations cause increasing performance degradation with higher severity levels.


40Perturbation Types
8Imaging Modalities
5VLMs Evaluated
2Seg. Models
5Fine-tuning Strategies
5Severity Levels

Abstract

Medical foundation models have achieved remarkable clinical performance, yet their robustness under real-world perturbations remains underexplored. We present a robustness benchmark comprising 40 perturbation types (12 base, 28 medical-specific) across eight imaging modalities, evaluating five VLMs (LLaVA-Med, MedGemma, MedGemma-1.5, Gemini-2.5-flash and GPT-4o-mini) on VQA, visual grounding, and captioning, alongside two segmentation models (MedSAM, SAM-Med2D) with five fine-tuning strategies.

Our findings reveal: (1) Fine-tuning strategy dominates robustness, with LoRA exhibiting nearly double the degradation of full fine-tuning, while SAM-Med2D's Adapter offers favorable efficiency–robustness trade-off. (2) Medical-specific perturbations disproportionately damage segmentation, with 9 of 15 top corruptions being domain-specific. (3) LoRA-tuned visual grounding drops over 40 points, whereas zero-shot captioning remains stable (<7% drop). These results provide deployment guidelines and underscore the necessity of domain-specific robustness evaluation for medical AI.

Key Findings

01

Fine-tuning strategy dominates robustness. LoRA exhibits ≈2× the degradation of full fine-tuning. SAM-Med2D's Adapter is the best PEFT efficiency–robustness trade-off.

02

Medical-specific corruptions are disproportionately harmful. 9 of top 15 perturbations are domain-specific — standard benchmarks underestimate real deployment risks.

03

Task formulation determines VLM robustness. LoRA-tuned Grounding drops >40 points, while zero-shot Captioning stays stable (<7% drop).

04

General VLMs excel at VQA but fail on Grounding. Gemini-2.5-flash: 54% relative drop. Medical VLMs are more stable; MedGemma shows the smallest drops overall.

Results

Comprehensive evaluation across segmentation and VLMs under 40 perturbation types at 5 severity levels.

Full results figure
Fig. 3. Left (Segmentation): (a) Performance–robustness trade-off. (b) Strategy ranking. (c) Model comparison. (d) Dataset sensitivity. (e) Top 15 perturbations. (f) Severity level impact. Right (VLMs): (g–i) Clean vs. perturbed on VQA, Grounding, Captioning. (j–l) Per-perturbation impact.

Segmentation — Strategy Ranking

RankStrategyIoU Drop
1Full fine-tuning0.025
2Dec-Only0.029
2Enc-Partial0.029
2Dec-Prompt0.029
5Adapter0.033
6LoRA0.048 (≈2×)

VLM — Task Robustness

TaskSettingDrop
CaptioningZero-shot<0.02 BLEU
VQA (med.)Zero-shot<8 pts
VQA (Gemini)Zero-shot36.1 pts (54%)
GroundingLoRA FT>40 pts

Benchmark Coverage

Perturbation Types (40 total)

  • Base (12): Gaussian/salt-pepper/speckle noise, Gaussian/motion blur, brightness, contrast, JPEG, pixelation, rotation, scaling, translation
  • Med-Specific (28): CT metal artifacts, MRI ghosting & bias-field, US acoustic shadowing, pathology stain variations, endoscopy bubbles & specular reflections, OCT shadow/blink/defocus, X-ray scatter & exposure, angiography haze

Datasets & Models

  • Segmentation: ISIC 2016, Brain Tumor MRI, Glaucoma Disc/Cup, Kvasir-SEG
  • VLM: OmniMedVQA, ROCOv2, MeCoVQA
  • Seg Models: MedSAM, SAM-Med2D
  • VLMs: LLaVA-Med, MedGemma, MedGemma-1.5, GPT-4o-mini, Gemini-2.5-flash
Dermoscopy MRI Fundus / OCT Endoscopy CT X-ray Ultrasound Pathology Angiography

BibTeX

@inproceedings{cui2026medfmrobust,
  title     = {MedFM-Robust: Benchmarking Robustness of
               Medical Foundation Models},
  author    = {Cui, Xiangxiang and Huang, Tianjin and Wang,
               Yifang and Hu, Lijie and Yin, Lu},
  booktitle = {Medical Image Computing and Computer
               Assisted Intervention (MICCAI)},
  year      = {2026},
  note      = {Accepted by MICCAI 2026}
}