A recent study, published in Nature Machine Intelligence, demonstrates how self-supervised learning (SSL) can enhance the analysis of single-cell genomics data. In training models on over 20 million single-cell RNA sequencing datasets, researchers observed that SSL outperforms traditional methods in cell-type prediction, gene-expression reconstruction, and integrating data from multiple studies.
SSL is a machine learning approach that extracts patterns from large datasets without requiring labeled data. In this study, the researchers adapted SSL techniques for single-cell genomics tasks and evaluated their performance across various benchmarks.
One major finding was that SSL pre-training on large datasets significantly improved cell-type prediction in smaller or unseen datasets. For example, in peripheral blood mononuclear cell (PBMC) data from COVID-19 patients, SSL increased the macro F1 score – a measure of accuracy – from 0.70 to 0.75. In another dataset, SSL correctly classified over 89 percent of a specific lung cell type, type II pneumocytes, versus just 32 percent under supervised learning.
SSL also excelled in "zero-shot" settings, where models analyze new datasets without additional fine-tuning. For example, the models achieved high accuracy in predicting cell types across datasets published after the initial training.
The study further demonstrated SSL's potential in integrating datasets from different experimental conditions and predicting multimodal data; for instance, SSL improved the prediction of protein expression from RNA data.
The researchers emphasized that SSL’s success depends on pre-training with diverse and high-quality datasets. By addressing typical challenges, such as data sparsity and batch effects, SSL is a scalable and powerful approach that could enhance precision medicine research.