model icon

scgpt

Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

scGPT

This is the official codebase for scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI.

Preprint

Documentation

!UPDATE: We have released several new pretrained scGPT checkpoints. Please see the Pretrained scGPT checkpoints section for more details.

Installation

scGPT is available on PyPI. To install scGPT, run the following command:

$ pip install scgpt

[Optional] We recommend using wandb for logging and visualization.

$ pip install wandb

For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.

$ git clone this-repo-url
$ cd scGPT
$ poetry install

Note: The flash-attn dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.

Pretrained scGPT Model Zoo

Here is the list of pretrained models. Please find the links for downloading the checkpoint folders. We recommend using the whole-human model for most applications by default. If your fine-tuning dataset shares similar cell type context with the training data of the organ-specific models, these models can usually demonstrate competitive performance as well.

Model name Description Download
whole-human (recommended) Pretrained on 33 million normal human cells. link
brain Pretrained on 13.2 million brain cells. link
blood Pretrained on 10.3 million blood and bone marrow cells. link
heart Pretrained on 1.8 million heart cells link
lung Pretrained on 2.1 million lung cells link
kidney Pretrained on 814 thousand kidney cells link
pan-cancer Pretrained on 5.7 million cells of various cancer types link

Fine-tune scGPT for scRNA-seq integration

Please see our example code in examples/finetune_integration.py. By default, the script assumes the scGPT checkpoint folder stored in the examples/save directory.

To-do-list

  • Upload the pretrained model checkpoint
  • Publish to pypi
  • Provide the pretraining code with generative attention masking
  • Finetuning examples for multi-omics integration, cell type annotation, perturbation prediction, cell generation
  • Example code for Gene Regulatory Network analysis
  • Documentation website with readthedocs
  • Bump up to pytorch 2.0
  • New pretraining on larger datasets
  • Reference mapping example
  • Publish to huggingface model hub

Contributing

We greatly welcome contributions to scGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using scGPT.

Acknowledgements

We sincerely thank the authors of following open-source projects:

Citing scGPT

@article{cui2023scGPT,
title={scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and Pang, Kuan and Luo, Fengning and Wang, Bo},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}

Related notebook

BioTuring

Spatial charting of single-cell transcriptomes in tissues - celltrek

Single-cell RNA sequencing methods can profile the transcriptomes of single cells but cannot preserve spatial information. Conversely, spatial transcriptomics assays can profile spatial regions in tissue sections but do not have single-cell resolution. Here, Runmin Wei (Siyuan He, Shanshan Bai, Emi Sei, Min Hu, Alastair Thompson, Ken Chen, Savitri Krishnamurthy & Nicholas E. Navin) developed a computational method called CellTrek that combines these two datasets to achieve single-cell spatial mapping through coembedding and metric learning approaches. They benchmarked CellTrek using simulation and in situ hybridization datasets, which demonstrated its accuracy and robustness. They then applied CellTrek to existing mouse brain and kidney datasets and showed that CellTrek can detect topological patterns of different cell types and cell states. They performed single-cell RNA sequencing and spatial transcriptomics experiments on two ductal carcinoma in situ tissues and applied CellTrek to identify tumor subclones that were restricted to different ducts, and specific T-cell states adjacent to the tumor areas.

T-cells

More

BioTuring

DoubletFinder: Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors

Single-cell RNA sequencing (scRNA-seq) data often encountered technical artifacts called "doublets" which are two cells that are sequenced under the same cellular barcode. Doublets formed from different cell types or states are called heterotypic and homotypic otherwise. These factors constrain cell throughput and may result in misleading biological interpretations. DoubletFinder (McGinnis, Murrow, and Gartner 2019) is one of the methods proposed for doublet detection. In this notebook, we will illustrate an example workflow of DoubletFinder. We use a 10x Genomics dataset which captures peripheral blood mononuclear cells (PBMCs) from a healthy donor stained with a panel of 31 TotalSeq™-B antibodies (BioLegend).

Mucus glandular cells

Breast myoepithelial cells

More

BioTuring

SpaCET: Cell type deconvolution and interaction analysis

Spatial transcriptomics (ST) technology has allowed to capture of topographical gene expression profiling of tumor tissues, but single-cell resolution is potentially lost. Identifying cell identities in ST datasets from tumors or other samples remains challenging for existing cell-type deconvolution methods. Spatial Cellular Estimator for Tumors (SpaCET) is an R package for analyzing cancer ST datasets to estimate cell lineages and intercellular interactions in the tumor microenvironment. Generally, SpaCET infers the malignant cell fraction through a gene pattern dictionary, then calibrates local cell densities and determines immune and stromal cell lineage fractions using a constrained regression model. Finally, the method can reveal putative cell-cell interactions in the tumor microenvironment. In this notebook, we will illustrate an example workflow for cell type deconvolution and interaction analysis on breast cancer ST data from 10X Visium. The notebook is inspired by SpaCET's vignettes and modified to demonstrate how the tool works on BioTuring's platform.

Ionocytes

Respiratory ciliated cells

More