UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation





paper
Arxiv
github
Code
Youtube
Video

Abstract

In the medical imaging domain, a common challenge is the collection of large-scale labeled data due to the high cost of manual labeling, privacy concerns, logistics. In this work, we present the UK Biobank Organs and Bones UKBOB, the largest labeled dataset of body organs of 51,761 MRI 3D samples (17.9 M 2D images) and a total of more than 1.37 billion 2D segmentation masks of 72 organs based on the UK Biobank MRI dataset. We utilize automatic labeling, introduce an automated label cleaning pipeline with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (UKBOB-manual). This approach allows for scaling up the dataset collection while maintaining confidence in the labels. We further confirm the validity of the labels by the zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from a similar domain ( e.g. abdominal MRI). To further elevate the effect of the noisy labels, we propose a novel Entropy Test-time Adaptation (ETTA) to refine the segmentation output. We use UKBOB to train a foundation model Swin-BOB for 3D medical image segmentation based on Swin-UNetr, achieving state-of-the-art results in several benchmarks in 3D medical imaging, including BRATS brain MRI tumour challenge (+0.4% improvement), and BTCV abdominal CT scan bench- mark (+1.3% improvement). The pre-trained model and our filtered labels will be made available with the UK Biobank.


Swin-BoB: Large-Scale Segmentation Model on Body MRIs from UK Biobank

Figure 1. UKBOB Size and Diversity. Our proposed UK BioBank Organs and Bones is the largest labeled medical imaging dataset for segmentation, comprising body organs of 51,761 MRI 3D samples (17.9 M 2D images) and a total of more than 1.37 billion 2D masks of 72 organs.We show a plot of the size (number of 2D images) and diversity (number of classes) of our method compared to other medical images datasets. The size of the bubbles indicates 2D image resolution.


Figure 2. UKBOB Labels with our Filtration. We show an example of labels in UKBOB on the coronal and sagittal plane.


Label Filtering

Figure 3. UKBOB Distribution of Labels with our Filtration. We show the distribution mean normalised volumes of 72 labels before and after filtration

Entropy Test-time Adaptation (ETTA)

Figure 4. Effect of Entropy Test Time Adaptation (ETTA). We show the benefit of using our proposed ETTA to improve the performance of fine-tuning models on BTCV (Xi Fang and Pingkun Yan (2020)), BRATS (Ujjwal Baid et al, Spyridon Bakas et al, and Bjoern H. Menze et al), and AMOS (Yuanfeng Ji et al, 2022). The best results are obtained by fine-tuning our baseline Swin-BOB pretrained on the UKBOB dataset. Our ETTA improves on different networks on different downstream tasks and better than the TTA baseline.


Zero-Shot Segmentation on Out-of-Domain (OOD) Data

Swin-BOB achieves state-of-the-art with up to 0.02 Dice score improvement and reduction of 2.4 in Mean Hausdorff Distance.

Table 1: 3D Segmentation Performance of Swin-BoB on BTCV dataset.

Figure 5: Qualitative comparison of Swin-BoB and baseline models on BTCV dataset.

Effect of Filtering

To collect the labels of UKBOB, we leverage automatic labeling based on TotalVibe Segmentator (Graf R. et al, 2024). However, automatic labeling of such a vast dataset introduces the possibility of noise and erroneous labels. To tackle this issue, we propose a novel filtering mechanism for organ labels based on a statistical Specialized Organ Labels Filter SOLF. A question arises on how to distinguish segmentation failures from common patient abnormalities. It is known that human organs follow typical geometric properties that arise from the body’s need to optimize function while minimizing energy expenditure and structural stress. To this end, we propose to use three features jointly: normalized volume, eccentricity,and sphericity. A sample is flagged as inaccurate if at least two of the three features (normalized volume, eccentricity, and sphericity) fall outside their respective acceptable ranges. Setting Ε for SOLF effectively discards samples with anomalous organ characteristics while retaining valid labels.

Figure 6. UKBOB Filtration Effect on Zero-shot 3D Segmenta- tion Performance.We show the zero-shot performance of Swin-BoB on AMOS external MRI data and CT (BTCV) for the same organ classes. We use this zero-shot performance to calibrate the quality of the collected labels at UKBOB. IQR refers to Interquartile Range, which is a measure of variability, based on dividing a data set into quartiles.

ETTA for Refined Segmentation on OOD Data

Figure 7. We show the benefit of using our proposed ETTA to improve the performance of fine-tuning models on BTCV, BRATS, and AMOS. The best results are obtained by fine-tuning our baseline Swin-BOB pretrained on the UKBOB dataset. Our ETTA improves on different networks on different downstream tasks and better than the TTA baseline.

References

Bjoern H. Menze et al. The multimodal brain tumor image segmentation benchmark (brats). I E E E Transactions on Medical Imaging,2015.

Ling Zhang, Xiaosong Wang, Dong Yang, Thomas Sanford, Stephanie A. Harmon, Baris I Turkbey, Bradford J. Wood, Holger R. Roth, Andriy Myronenko, Daguang Xu, and Ziyue Xu. Generalizing deep learning for medical image segmen- tation to unseen domains via deep stacked transformation. IEEE Transactions on Medical Imaging, 39:2531–2540, 2020.

Robert Graf, Paul-Sören Platzek, Evamaria Olga Riedel, Con- stanze Ramschütz, Sophie Starck, Hendrik Kristian Möller, Matan Atad, Henry Völzke, Robin Bülow, Carsten Oliver Schmidt, et al. Totalvibesegmentator: Full body mri seg- mentation for the nako and uk biobank. arXiv preprint arXiv:2406.00125, 2024

Spyridon Bakas et al. Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific Data, 4,Sept. 2017.

Ujjwal Baid et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and ra-diogenomic classification.

Xi Fang and Pingkun Yan. Multi-organ segmentation over partially labeled datasets with multi-scale feature abstraction, 2020.

Yuanfeng Ji, Haotian Bai, Chongjian GE, Jie Yang, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhanng, Wanling Ma, Xi- ang Wan, and Ping Luo. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmen- tation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36722–36732. Curran Associates, Inc., 2022.


Citation

@misc{bourigault2025ukbob,
  title = {UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation},
  author = {Emmanuelle Bourigault and Amir Jamaludin and Abdullah Hamdi},
  year = {2024},
  eprint = {2404.19604},
  archivePrefix = {arXiv},
  primaryClass = {eess.IV},
}