Skip to main content

Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis

Posted by on Monday, July 25, 2022 in Abdomen Imaging, Big Data, Body-Wise, Computed Tomography, Deep Learning.

Tang, Yucheng, Dong Yang, Wenqi Li, Holger R. Roth, Bennett Landman, Daguang Xu, Vishwesh Nath, and Ali Hatamizadeh. “Self-supervised pre-training of swin transformers for 3d medical image analysis.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20730-20740. 2022.
Full text: 

Abstract

Vision Transformers (ViT)s have shown great performance in self-supervised learning of global and local representations that can be transferred to downstream applications. Inspired by these results, we introduce a novel self-supervised learning framework with tailored proxy tasks for medical image analysis. Specifically, we propose: (i) a new 3D transformer-based model, dubbed Swin UNEt TRansformers (Swin UNETR), with a hierarchical encoder for self-supervised pre-training; (ii) tailored proxy tasks for learning the underlying pattern of human anatomy. We demonstrate successful pre-training of the proposed model on 5,050 publicly available computed tomography (CT) images from various body organs. The effectiveness of our approach is validated by fine-tuning the pre-trained models on the Beyond the Cranial Vault (BTCV) Segmentation Challenge with 13 abdominal organs and segmentation tasks from the Medical Segmentation Decathlon (MSD) dataset. Our model is currently the state-of-the-art on the public test leaderboards of both MSD and BTCV datasets. Code: https://monai.io/research/swin unetr.

Overview of our proposed pre-training framework. Input CT images are randomly cropped into sub-volumes and augmented with random inner cutout and rotation, then fed to the Swin UNETR encoder as input. We use masked volume inpainting, contrastive learning and rotation prediction as proxy tasks for learning contextual representations of input images
Overview of our proposed pre-training framework. Input CT images are randomly cropped into sub-volumes and augmented with random inner cutout and rotation, then fed to the Swin UNETR encoder as input. We use masked volume inpainting, contrastive learning and rotation prediction as proxy tasks for learning contextual representations of input images

 

Overview of the Swin UNETR architecture.
Overview of the Swin UNETR architecture.

Tags: , ,