library(knitr)
opts_chunk$set(cache = TRUE, out.width='100%', fig.align = 'center')
The code below will install some packages if you don't have them already.
library(Seurat)
library(SeuratWrappers)
library(DropletUtils)
library(ggplot2)
library(RColorBrewer)
library(knitr)
library(dplyr)
library(harmony)
library(rliger)
library(glmGamPoi)
library(patchwork)
library(cowplot)
Warning message: “package ‘Seurat’ was built under R version 4.2.2” Attaching SeuratObject Warning message: “package ‘DropletUtils’ was built under R version 4.2.2” Loading required package: SingleCellExperiment Warning message: “package ‘SingleCellExperiment’ was built under R version 4.2.2” Loading required package: SummarizedExperiment Warning message: “package ‘SummarizedExperiment’ was built under R version 4.2.2” Loading required package: MatrixGenerics Loading required package: matrixStats Warning message: “package ‘matrixStats’ was built under R version 4.2.2” Attaching package: ‘MatrixGenerics’ The following objects are masked from ‘package:matrixStats’: colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars Loading required package: GenomicRanges Warning message: “package ‘GenomicRanges’ was built under R version 4.2.2” Loading required package: stats4 Loading required package: BiocGenerics Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: S4Vectors Attaching package: ‘S4Vectors’ The following objects are masked from ‘package:base’: expand.grid, I, unname Loading required package: IRanges Loading required package: GenomeInfoDb Warning message: “package ‘GenomeInfoDb’ was built under R version 4.2.2” Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Attaching package: ‘Biobase’ The following object is masked from ‘package:MatrixGenerics’: rowMedians The following objects are masked from ‘package:matrixStats’: anyMissing, rowMedians Attaching package: ‘SummarizedExperiment’ The following object is masked from ‘package:SeuratObject’: Assays The following object is masked from ‘package:Seurat’: Assays Warning message: “package ‘ggplot2’ was built under R version 4.2.2” Warning message: “package ‘knitr’ was built under R version 4.2.2” Warning message: “package ‘dplyr’ was built under R version 4.2.2” Attaching package: ‘dplyr’ The following object is masked from ‘package:Biobase’: combine The following objects are masked from ‘package:GenomicRanges’: intersect, setdiff, union The following object is masked from ‘package:GenomeInfoDb’: intersect The following objects are masked from ‘package:IRanges’: collapse, desc, intersect, setdiff, slice, union The following objects are masked from ‘package:S4Vectors’: first, intersect, rename, setdiff, setequal, union The following objects are masked from ‘package:BiocGenerics’: combine, intersect, setdiff, union The following object is masked from ‘package:matrixStats’: count The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: Rcpp Warning message: “package ‘Rcpp’ was built under R version 4.2.2” Loading required package: cowplot Loading required package: Matrix Warning message: “package ‘Matrix’ was built under R version 4.2.2” Attaching package: ‘Matrix’ The following object is masked from ‘package:S4Vectors’: expand Loading required package: patchwork Attaching package: ‘patchwork’ The following object is masked from ‘package:cowplot’: align_plots Attaching package: ‘rliger’ The following object is masked from ‘package:BiocGenerics’: normalize Warning message: “package ‘glmGamPoi’ was built under R version 4.2.2” Attaching package: ‘glmGamPoi’ The following object is masked from ‘package:dplyr’: vars The following object is masked from ‘package:ggplot2’: vars
Let's also load a custom function, plot_integrated_clusters, that is stored in /data as custom_seurat_functions.R
:
source("./data/cellgeni/custom_seurat_functions.R")
As more and more scRNA-seq datasets become available, carrying merged_seurat comparisons between them is key. There are two main approaches to comparing scRNASeq datasets. The first approach is "label-centric" which is focused on trying to identify equivalent cell-types/states across datasets by comparing individual cells or groups of cells. The other approach is "cross-dataset normalization" which attempts to computationally remove experiment-specific technical/biological effects so that data from multiple experiments can be combined and jointly analyzed.
We will illustrate this process using two very different datasets: 1) 10k PBMC (3' v3 chemistry) downloaded from 10x Genomics website; 2) major cell populations isolated from whole human blood using FACS, and sequenced using STRT-seq (GSE149938).
Although we already have all the necessary files in our /data
folder, we can download the necessary files from GEO database:
Let's read the processed file available via GEO, and the 10x file we've processed using SoupX (see another notebook in this repository):
umi_gz <- gzfile("./data/cellgeni/GSE149938_umi_matrix.csv.gz",'rt')
umi <- read.csv(umi_gz,check.names = F,quote = "")
matrix_3p <- Read10X("./data/cellgeni/soupX_pbmc10k_filt")
Next, let's make Seurat
objects and re-define some of the metadata columns (GEO dataset simply puts the cell type into the orig.ident
slot, which will interfere with what we want to do next):
srat_wb <- CreateSeuratObject(t(umi),project = "whole_blood")
srat_3p <- CreateSeuratObject(matrix_3p,project = "pbmc10k_3p")
rm(umi_gz)
rm(umi)
rm(matrix_3p)
Warning message: “Non-unique features (rownames) present in the input matrix, making unique”
colnames(srat_wb@meta.data)[1] <- "cell_type"
srat_wb@meta.data$orig.ident <- "whole_blood"
srat_wb@meta.data$orig.ident <- as.factor(srat_wb@meta.data$orig.ident)
head(srat_wb[[]])
cell_type | nCount_RNA | nFeature_RNA | orig.ident | |
---|---|---|---|---|
<fct> | <dbl> | <int> | <fct> | |
BNK_spBM1_L1_bar25 | BNK | 24494 | 1869 | whole_blood |
BNK_spBM1_L1_bar26 | BNK | 61980 | 2051 | whole_blood |
BNK_spBM1_L1_bar27 | BNK | 124382 | 3872 | whole_blood |
BNK_spBM1_L1_bar28 | BNK | 8144 | 1475 | whole_blood |
BNK_spBM1_L1_bar29 | BNK | 53612 | 2086 | whole_blood |
BNK_spBM1_L1_bar30 | BNK | 33582 | 2038 | whole_blood |
Do basic quality control. STRT-Seq is quite different from 10x and has a lot more detected genes per cell. Also, for some reason we don't have the MT genes in the quantified matrix of the whole blood dataset. That's unfortunate, but not critical.
srat_wb <- SetIdent(srat_wb,value = "orig.ident")
srat_wb[["percent.mt"]] <- PercentageFeatureSet(srat_wb, pattern = "^MT-")
srat_wb[["percent.rbp"]] <- PercentageFeatureSet(srat_wb, pattern = "^RP[SL]")
srat_3p[["percent.mt"]] <- PercentageFeatureSet(srat_3p, pattern = "^MT-")
srat_3p[["percent.rbp"]] <- PercentageFeatureSet(srat_3p, pattern = "^RP[SL]")
VlnPlot(srat_wb, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)
VlnPlot(srat_3p, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)
Warning message in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents = idents, : “All cells have the same value of percent.mt.”
The annotation that was used to process the GEO whole blood dataset is quite different from the Cell Ranger GRCh38-2020A. Let's see how many common genes are there:
table(rownames(srat_3p) %in% rownames(srat_wb))
common_genes <- rownames(srat_3p)[rownames(srat_3p) %in% rownames(srat_wb)]
FALSE TRUE 18050 18551
Let's filter the cells with too high or too low number of genes, or too high MT gene content. Generally speaking, cleaner datasets (both in terms of ambient RNA and in terms of low-quality cells) integrate better.
Also, let's limit the individual matrices to common genes only:
srat_3p <- subset(srat_3p, subset = nFeature_RNA > 1000 & nFeature_RNA < 6000 & percent.mt < 15)
srat_wb <- subset(srat_wb, subset = nFeature_RNA > 1000)
srat_3p <- srat_3p[rownames(srat_3p) %in% common_genes,]
srat_wb <- srat_wb[rownames(srat_wb) %in% common_genes,]
As previously for Seurat
v3, let's make a list and normalize/find HVG for each object:
wb_list <- list()
wb_list[["pbmc10k_3p"]] <- srat_3p
wb_list[["whole_blood"]] <- srat_wb
for (i in 1:length(wb_list)) {
wb_list[[i]] <- NormalizeData(wb_list[[i]], verbose = F)
wb_list[[i]] <- FindVariableFeatures(wb_list[[i]], selection.method = "vst", nfeatures = 2000, verbose = F)
}
Here we actually do the integration. Seurat 3 does it in two steps.
wb_anchors <- FindIntegrationAnchors(object.list = wb_list, dims = 1:30)
wb_seurat <- IntegrateData(anchorset = wb_anchors, dims = 1:30)
Computing 2000 integration features Scaling features for provided objects Finding all pairwise anchors Running CCA Merging objects Finding neighborhoods Finding anchors Found 11508 anchors Filtering anchors Retained 3175 anchors Merging dataset 2 into 1 Extracting anchors for merged samples Finding integration vectors Finding integration vector weights Integrating data
rm(wb_list)
rm(wb_anchors)
Let's do the basic processing and visualization of the uncorrected dataset:
DefaultAssay(wb_seurat) <- "RNA"
wb_seurat <- NormalizeData(wb_seurat, verbose = F)
wb_seurat <- FindVariableFeatures(wb_seurat, selection.method = "vst", nfeatures = 2000, verbose = F)
wb_seurat <- ScaleData(wb_seurat, verbose = F)
wb_seurat <- RunPCA(wb_seurat, npcs = 30, verbose = F)
wb_seurat <- RunUMAP(wb_seurat, reduction = "pca", dims = 1:30, verbose = F)
DimPlot(wb_seurat, reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, before integration")
DimPlot(wb_seurat, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()
Warning message: “The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation' This message will be shown once per session”
Now, let's take a look at the integrated dataset (it's already normalized and HVGs are selected):
DefaultAssay(wb_seurat) <- "integrated"
wb_seurat <- ScaleData(wb_seurat, verbose = F)
wb_seurat <- RunPCA(wb_seurat, npcs = 30, verbose = F)
wb_seurat <- RunUMAP(wb_seurat, reduction = "pca", dims = 1:30, verbose = F)
DimPlot(wb_seurat, reduction = "umap") + plot_annotation(title = "10k 3' PBMC and white blood cells, after integration (Seurat 3)")
Let's look at some markers:
FeaturePlot(wb_seurat,c("MS4A1","LYZ","NKG7","PPBP","LTF","HBA1","FCER1A","IL7R","FCGR3B")) & scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale.
From the plot we can see that there are some significant cell types that are absent from PBMC dataset, but exist in the whole blood dataset. LTF gene is the most prominent marker of neutrophils, and HBA1 is a hemoglobin gene expressed in erythrocytes.
Now let's cluster the integrated matrix and look how clusters are distributed between the two sets:
wb_seurat <- FindNeighbors(wb_seurat, dims = 1:30, k.param = 10, verbose = F)
wb_seurat <- FindClusters(wb_seurat, verbose = F)
DimPlot(wb_seurat,label = T) + NoLegend()
Cluster composition shows many clusters unique to the whole blood dataset:
count_table <- table(wb_seurat@meta.data$seurat_clusters, wb_seurat@meta.data$orig.ident)
count_table
pbmc10k_3p whole_blood 0 1564 90 1 1387 211 2 1222 113 3 1217 111 4 1066 107 5 342 442 6 9 772 7 448 232 8 185 410 9 1 560 10 362 193 11 0 521 12 0 490 13 324 158 14 362 83 15 0 445 16 290 124 17 12 367 18 0 371 19 8 355 20 0 334 21 11 313 22 312 11 23 0 209 24 19 179 25 0 188 26 99 19 27 0 107 28 0 70 29 0 58
plot_integrated_clusters(wb_seurat)
Using cluster as id variables
We can take advantage of the metadata that was present in GSE149938:
meta <- wb_seurat[[]]
table(meta[meta$seurat_clusters == '5',]$cell_type) ## erythrocytes
table(meta[meta$seurat_clusters == '20',]$cell_type) ## neutrophils
table(meta[meta$seurat_clusters == '24',]$cell_type) ## plasma B cells
ery immB memB naiB preB regB 2 14 2 315 1 108
CLP immB kineNK matureN metaN MLP myeN plasma proN toxiNK 1 1 5 98 101 1 104 2 16 5
ery naiB plasma 12 1 166
rm(wb_seurat)
Similarly to the previous approaches, let's make a merged Seurat
dataset, normalize and process it:
wb_harmony <- merge(srat_3p,srat_wb)
wb_harmony <- NormalizeData(wb_harmony, verbose = F)
wb_harmony <- FindVariableFeatures(wb_harmony, selection.method = "vst", nfeatures = 2000, verbose = F)
wb_harmony <- ScaleData(wb_harmony, verbose = F)
wb_harmony <- RunPCA(wb_harmony, npcs = 30, verbose = F)
wb_harmony <- RunUMAP(wb_harmony, reduction = "pca", dims = 1:30, verbose = F)
We can take a look at the PCA plot for a change, as well as distributions along the first principal component:
p1 <- DimPlot(object = wb_harmony, reduction = "pca", pt.size = .1, group.by = "orig.ident") + NoLegend()
p2 <- VlnPlot(object = wb_harmony, features = "PC_1", group.by = "orig.ident", pt.size = .1) + NoLegend()
plot_grid(p1,p2)
UMAP also shows clear differences between the datasets (we saw this plot above already):
DimPlot(wb_harmony,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, before integration")
Let's run harmony
using a simple wrapper named RunHarmony
from SeuratWrappers
library:
wb_harmony <- wb_harmony %>% RunHarmony("orig.ident", plot_convergence = T)
Harmony 1/10 Harmony 2/10 Harmony 3/10 Harmony converged after 3 iterations Warning message: “Invalid name supplied, making object name syntactically valid. New object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details on syntax validity”
This generates the embeddings that we shall later use for all downstream analysis.
harmony_embeddings <- Embeddings(wb_harmony, 'harmony')
harmony_embeddings[1:5, 1:5]
harmony_1 | harmony_2 | harmony_3 | harmony_4 | harmony_5 | |
---|---|---|---|---|---|
AAACCCACATAACTCG-1 | 3.2402720 | 0.5517086 | -4.014592 | -0.2490492 | -0.4555420 |
AAACCCACATGTAACC-1 | 4.3992278 | -0.2653914 | 8.905441 | 5.7085193 | -2.6650973 |
AAACCCAGTGAGTCAG-1 | 3.6924188 | 1.4199533 | -2.323807 | -1.7352863 | 1.2648235 |
AAACGAACAGTCAGTT-1 | -0.8957237 | -3.2318868 | -2.398822 | -0.1342351 | -1.9238917 |
AAACGAACATTCGGGC-1 | 4.6218704 | 1.6066620 | 1.914114 | -1.7354947 | 0.8168312 |
Corrected PCA and distribution:
p1 <- DimPlot(object = wb_harmony, reduction = "harmony", pt.size = .1, group.by = "orig.ident") + NoLegend()
p2 <- VlnPlot(object = wb_harmony, features = "harmony_1", group.by = "orig.ident", pt.size = .1) + NoLegend()
plot_grid(p1,p2)
Run UMAP and perform Louvain clustering:
wb_harmony <- wb_harmony %>%
RunUMAP(reduction = "harmony", dims = 1:30, verbose = F) %>%
FindNeighbors(reduction = "harmony", k.param = 10, dims = 1:30) %>%
FindClusters() %>%
identity()
Computing nearest neighbor graph Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck Number of nodes: 16883 Number of edges: 275342 Running Louvain algorithm... Maximum modularity in 10 random starts: 0.9231 Number of communities: 29 Elapsed time: 1 seconds
wb_harmony <- SetIdent(wb_harmony,value = "orig.ident")
DimPlot(wb_harmony,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, after integration (Harmony)")
DimPlot(wb_harmony, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()
Corrected results for this dataset appear to be very similar to Seurat v3:
wb_harmony <- SetIdent(wb_harmony,value = "seurat_clusters")
DimPlot(wb_harmony,label = T) + NoLegend()
More detailed cluster examination also seems to confirm this:
plot_integrated_clusters(wb_harmony)
Using cluster as id variables
rm(wb_harmony)
Finally, let's do data integration with LIGER
. This step takes several minutes to run:
wb_liger <- merge(srat_3p,srat_wb)
wb_liger <- NormalizeData(wb_liger)
wb_liger <- FindVariableFeatures(wb_liger)
wb_liger <- ScaleData(wb_liger, split.by = "orig.ident", do.center = F)
wb_liger <- RunOptimizeALS(wb_liger, k = 30, lambda = 5, split.by = "orig.ident")
wb_liger <- RunQuantileNorm(wb_liger, split.by = "orig.ident")
Scaling data matrix Scaling data from split pbmc10k_3p Scaling data from split whole_blood
|======================================================================| 100% Finished in 3.81193 mins, 30 iterations. Max iterations set: 30. Final objective delta: 4.18028e-05. Best results with seed 1.
Warning message: “No columnames present in cell embeddings, setting to 'riNMF_1:30'”
We will then perform Louvain clustering (FindNeighbors
and FindClusters
) with the settings similar to what we have been using before:
wb_liger <- FindNeighbors(wb_liger,reduction = "iNMF",k.param = 10,dims = 1:30)
wb_liger <- FindClusters(wb_liger)
Computing nearest neighbor graph Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck Number of nodes: 16883 Number of edges: 276592 Running Louvain algorithm... Maximum modularity in 10 random starts: 0.9233 Number of communities: 38 Elapsed time: 0 seconds
Let's look at the corrected UMAP plot in a couple of different ways:
wb_liger <- RunUMAP(wb_liger, dims = 1:ncol(wb_liger[["iNMF"]]), reduction = "iNMF",verbose = F)
wb_liger <- SetIdent(wb_liger,value = "orig.ident")
DimPlot(wb_liger,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and 10k 5' PBMC cells, after integration (LIGER)")
DimPlot(wb_liger, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()
Finally, a look at distribution of datasets per cluster:
plot_integrated_clusters(wb_liger)
Using cluster as id variables
rm(wb_liger)
rm(srat_wb)
rm(srat_3p)
sessionInfo()