MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

40Perturbation Types

8Imaging Modalities

5VLMs Evaluated

2Seg. Models

5Fine-tuning Strategies

5Severity Levels

Abstract

Medical foundation models have achieved remarkable clinical performance, yet their robustness under real-world perturbations remains underexplored. We present a robustness benchmark comprising 40 perturbation types (12 base, 28 medical-specific) across eight imaging modalities, evaluating five VLMs (LLaVA-Med, MedGemma, MedGemma-1.5, Gemini-2.5-flash and GPT-4o-mini) on VQA, visual grounding, and captioning, alongside two segmentation models (MedSAM, SAM-Med2D) with five fine-tuning strategies.

Our findings reveal: (1) Fine-tuning strategy dominates robustness, with LoRA exhibiting nearly double the degradation of full fine-tuning, while SAM-Med2D's Adapter offers favorable efficiency–robustness trade-off. (2) Medical-specific perturbations disproportionately damage segmentation, with 9 of 15 top corruptions being domain-specific. (3) LoRA-tuned visual grounding drops over 40 points, whereas zero-shot captioning remains stable (<7% drop). These results provide deployment guidelines and underscore the necessity of domain-specific robustness evaluation for medical AI.

Key Findings

Fine-tuning strategy dominates robustness. LoRA exhibits ≈2× the degradation of full fine-tuning. SAM-Med2D's Adapter is the best PEFT efficiency–robustness trade-off.

Medical-specific corruptions are disproportionately harmful. 9 of top 15 perturbations are domain-specific — standard benchmarks underestimate real deployment risks.

Task formulation determines VLM robustness. LoRA-tuned Grounding drops >40 points, while zero-shot Captioning stays stable (<7% drop).

General VLMs excel at VQA but fail on Grounding. Gemini-2.5-flash: 54% relative drop. Medical VLMs are more stable; MedGemma shows the smallest drops overall.

Results

Comprehensive evaluation across segmentation and VLMs under 40 perturbation types at 5 severity levels.

Fig. 3. Left (Segmentation): (a) Performance–robustness trade-off. (b) Strategy ranking. (c) Model comparison. (d) Dataset sensitivity. (e) Top 15 perturbations. (f) Severity level impact. Right (VLMs): (g–i) Clean vs. perturbed on VQA, Grounding, Captioning. (j–l) Per-perturbation impact.

Segmentation — Strategy Ranking

Rank	Strategy	IoU Drop
1	Full fine-tuning	0.025
2	Dec-Only	0.029
2	Enc-Partial	0.029
2	Dec-Prompt	0.029
5	Adapter	0.033
6	LoRA	0.048 (≈2×)

VLM — Task Robustness

Task	Setting	Drop
Captioning	Zero-shot	<0.02 BLEU
VQA (med.)	Zero-shot	<8 pts
VQA (Gemini)	Zero-shot	36.1 pts (54%)
Grounding	LoRA FT	>40 pts

Benchmark Coverage

Perturbation Types (40 total)

Base (12): Gaussian/salt-pepper/speckle noise, Gaussian/motion blur, brightness, contrast, JPEG, pixelation, rotation, scaling, translation
Med-Specific (28): CT metal artifacts, MRI ghosting & bias-field, US acoustic shadowing, pathology stain variations, endoscopy bubbles & specular reflections, OCT shadow/blink/defocus, X-ray scatter & exposure, angiography haze

Datasets & Models

Segmentation: ISIC 2016, Brain Tumor MRI, Glaucoma Disc/Cup, Kvasir-SEG
VLM: OmniMedVQA, ROCOv2, MeCoVQA
Seg Models: MedSAM, SAM-Med2D
VLMs: LLaVA-Med, MedGemma, MedGemma-1.5, GPT-4o-mini, Gemini-2.5-flash

Dermoscopy MRI Fundus / OCT Endoscopy CT X-ray Ultrasound Pathology Angiography

BibTeX

@inproceedings{cui2026medfmrobust,
  title     = {MedFM-Robust: Benchmarking Robustness of
               Medical Foundation Models},
  author    = {Cui, Xiangxiang and Huang, Tianjin and Wang,
               Yifang and Hu, Lijie and Yin, Lu},
  booktitle = {Medical Image Computing and Computer
               Assisted Intervention (MICCAI)},
  year      = {2026},
  note      = {Accepted by MICCAI 2026}
}