LEAF: Latent Diffusion with Efficient Encoder Distillation for Aligned Features in Medical Image Segmentation

MICCAI 2025

1Sun Yat-sen University 2Johns Hopkins University Corresponding author.

Abstract

Leveraging the powerful capabilities of diffusion models has yielded quite effective results in medical image segmentation tasks. However, existing methods typically transfer the original training process directly without specific adjustments for segmentation tasks. Furthermore, the commonly used pre-trained diffusion models still have deficiencies in feature extraction.

Based on these considerations, we propose LEAF, a medical image segmentation model grounded in latent diffusion models. During the fine-tuning process, we replace the original noise prediction pattern with a direct prediction of the segmentation map, thereby reducing the variance of segmentation results. We also employ a feature distillation method to align the hidden states of the convolutional layers with the features from a transformer-based vision encoder.

Experimental results demonstrate that our method enhances the performance of the original diffusion model across multiple segmentation datasets for different disease types. Notably, our approach does not alter the model architecture, nor does it increase the number of parameters or computation during the inference phase, making it highly efficient.

Methodology

Fine-tuning Framework

After the pre-trained VAE encoder $\mathcal{E}$ encodes the image $\textbf{x}$ and annotation $\textbf{y}$ to the latent space: ①the denoiser U-Net model $f_\theta$ is fine-tuned using $x_0$-prediction; ②we introduce a feature alignment method via distilling a vision encoder to enhance the U-Net's representation; ③we employ a single-step reverse process for efficient inference.

LEAF training scheme

Inference Scheme

The noisy latent variable \(\mathbf{z_t^y}\) and encoded image \(\mathbf{z^x}\) are concatenated to form the input. We set \(t\) to a random time step during training and use a single-step reverse process during inference. The U-Net model then predicts the latent segmentation map that is further decoded to get the final output. The feature alignment module is removed during inference, incurring no additional computational overhead.

LEAF inference scheme

Experiments

Multi-Type Medical Image Segmentation

Quantitative comparison of LEAF against discriminative and diffusion-based models across four medical image segmentation datasets.

Comparison with other methods

Ablation Study

The $x_{0}$-prediction enhanced with feature alignment consistently outperforms all other methods, demonstrating especially large improvements over the standard $ε$-prediction approach.

Ablation

Please refer to our paper linked above for more technical details :)

BibTeX

@InProceedings{huang2025leaf,
    title={LEAF: Latent Diffusion with Efficient Encoder Distillation for Aligned Features in Medical Image Segmentation},
    author={Huang, Qilin and Lin, Tianyu and Chen, Zhiguang and Zheng, Fudan},
    booktitle={Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
    year={2025},
    publisher={Springer Nature Switzerland},
}