library(knitr)
opts_chunk$set(cache = TRUE, out.width='100%', fig.align = 'center')
The code below will install some packages if you don't have them already.
library(Seurat)
library(SeuratWrappers)
library(DropletUtils)
library(ggplot2)
library(RColorBrewer)
library(knitr)
library(dplyr)
library(harmony)
library(rliger)
library(glmGamPoi)
library(patchwork)
library(cowplot)
Warning message:
“package ‘Seurat’ was built under R version 4.2.2”
Attaching SeuratObject
Warning message:
“package ‘DropletUtils’ was built under R version 4.2.2”
Loading required package: SingleCellExperiment
Warning message:
“package ‘SingleCellExperiment’ was built under R version 4.2.2”
Loading required package: SummarizedExperiment
Warning message:
“package ‘SummarizedExperiment’ was built under R version 4.2.2”
Loading required package: MatrixGenerics
Loading required package: matrixStats
Warning message:
“package ‘matrixStats’ was built under R version 4.2.2”
Attaching package: ‘MatrixGenerics’
The following objects are masked from ‘package:matrixStats’:
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Warning message:
“package ‘GenomicRanges’ was built under R version 4.2.2”
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Warning message:
“package ‘GenomeInfoDb’ was built under R version 4.2.2”
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
The following object is masked from ‘package:MatrixGenerics’:
rowMedians
The following objects are masked from ‘package:matrixStats’:
anyMissing, rowMedians
Attaching package: ‘SummarizedExperiment’
The following object is masked from ‘package:SeuratObject’:
Assays
The following object is masked from ‘package:Seurat’:
Assays
Warning message:
“package ‘ggplot2’ was built under R version 4.2.2”
Warning message:
“package ‘knitr’ was built under R version 4.2.2”
Warning message:
“package ‘dplyr’ was built under R version 4.2.2”
Attaching package: ‘dplyr’
The following object is masked from ‘package:Biobase’:
combine
The following objects are masked from ‘package:GenomicRanges’:
intersect, setdiff, union
The following object is masked from ‘package:GenomeInfoDb’:
intersect
The following objects are masked from ‘package:IRanges’:
collapse, desc, intersect, setdiff, slice, union
The following objects are masked from ‘package:S4Vectors’:
first, intersect, rename, setdiff, setequal, union
The following objects are masked from ‘package:BiocGenerics’:
combine, intersect, setdiff, union
The following object is masked from ‘package:matrixStats’:
count
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Loading required package: Rcpp
Warning message:
“package ‘Rcpp’ was built under R version 4.2.2”
Loading required package: cowplot
Loading required package: Matrix
Warning message:
“package ‘Matrix’ was built under R version 4.2.2”
Attaching package: ‘Matrix’
The following object is masked from ‘package:S4Vectors’:
expand
Loading required package: patchwork
Attaching package: ‘patchwork’
The following object is masked from ‘package:cowplot’:
align_plots
Attaching package: ‘rliger’
The following object is masked from ‘package:BiocGenerics’:
normalize
Warning message:
“package ‘glmGamPoi’ was built under R version 4.2.2”
Attaching package: ‘glmGamPoi’
The following object is masked from ‘package:dplyr’:
vars
The following object is masked from ‘package:ggplot2’:
vars
Let's also load a custom function, plot_integrated_clusters, that is stored in /data as custom_seurat_functions.R:
source("./data/cellgeni/custom_seurat_functions.R")
As more and more scRNA-seq datasets become available, carrying merged_seurat comparisons between them is key. There are two main approaches to comparing scRNASeq datasets. The first approach is "label-centric" which is focused on trying to identify equivalent cell-types/states across datasets by comparing individual cells or groups of cells. The other approach is "cross-dataset normalization" which attempts to computationally remove experiment-specific technical/biological effects so that data from multiple experiments can be combined and jointly analyzed.
We will illustrate this process using two very different datasets: 1) 10k PBMC (3' v3 chemistry) downloaded from 10x Genomics website; 2) major cell populations isolated from whole human blood using FACS, and sequenced using STRT-seq (GSE149938).
Although we already have all the necessary files in our /data folder, we can download the necessary files from GEO database:
Let's read the processed file available via GEO, and the 10x file we've processed using SoupX (see another notebook in this repository):
umi_gz <- gzfile("./data/cellgeni/GSE149938_umi_matrix.csv.gz",'rt')
umi <- read.csv(umi_gz,check.names = F,quote = "")
matrix_3p <- Read10X("./data/cellgeni/soupX_pbmc10k_filt")
Next, let's make Seurat objects and re-define some of the metadata columns (GEO dataset simply puts the cell type into the orig.ident slot, which will interfere with what we want to do next):
srat_wb <- CreateSeuratObject(t(umi),project = "whole_blood")
srat_3p <- CreateSeuratObject(matrix_3p,project = "pbmc10k_3p")
rm(umi_gz)
rm(umi)
rm(matrix_3p)
Warning message: “Non-unique features (rownames) present in the input matrix, making unique”
colnames(srat_wb@meta.data)[1] <- "cell_type"
srat_wb@meta.data$orig.ident <- "whole_blood"
srat_wb@meta.data$orig.ident <- as.factor(srat_wb@meta.data$orig.ident)
head(srat_wb[[]])
| cell_type | nCount_RNA | nFeature_RNA | orig.ident | |
|---|---|---|---|---|
| <fct> | <dbl> | <int> | <fct> | |
| BNK_spBM1_L1_bar25 | BNK | 24494 | 1869 | whole_blood |
| BNK_spBM1_L1_bar26 | BNK | 61980 | 2051 | whole_blood |
| BNK_spBM1_L1_bar27 | BNK | 124382 | 3872 | whole_blood |
| BNK_spBM1_L1_bar28 | BNK | 8144 | 1475 | whole_blood |
| BNK_spBM1_L1_bar29 | BNK | 53612 | 2086 | whole_blood |
| BNK_spBM1_L1_bar30 | BNK | 33582 | 2038 | whole_blood |
Do basic quality control. STRT-Seq is quite different from 10x and has a lot more detected genes per cell. Also, for some reason we don't have the MT genes in the quantified matrix of the whole blood dataset. That's unfortunate, but not critical.
srat_wb <- SetIdent(srat_wb,value = "orig.ident")
srat_wb[["percent.mt"]] <- PercentageFeatureSet(srat_wb, pattern = "^MT-")
srat_wb[["percent.rbp"]] <- PercentageFeatureSet(srat_wb, pattern = "^RP[SL]")
srat_3p[["percent.mt"]] <- PercentageFeatureSet(srat_3p, pattern = "^MT-")
srat_3p[["percent.rbp"]] <- PercentageFeatureSet(srat_3p, pattern = "^RP[SL]")
VlnPlot(srat_wb, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)
VlnPlot(srat_3p, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)
Warning message in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents = idents, : “All cells have the same value of percent.mt.”
The annotation that was used to process the GEO whole blood dataset is quite different from the Cell Ranger GRCh38-2020A. Let's see how many common genes are there:
table(rownames(srat_3p) %in% rownames(srat_wb))
common_genes <- rownames(srat_3p)[rownames(srat_3p) %in% rownames(srat_wb)]
FALSE TRUE 18050 18551
Let's filter the cells with too high or too low number of genes, or too high MT gene content. Generally speaking, cleaner datasets (both in terms of ambient RNA and in terms of low-quality cells) integrate better.
Also, let's limit the individual matrices to common genes only:
srat_3p <- subset(srat_3p, subset = nFeature_RNA > 1000 & nFeature_RNA < 6000 & percent.mt < 15)
srat_wb <- subset(srat_wb, subset = nFeature_RNA > 1000)
srat_3p <- srat_3p[rownames(srat_3p) %in% common_genes,]
srat_wb <- srat_wb[rownames(srat_wb) %in% common_genes,]
As previously for Seurat v3, let's make a list and normalize/find HVG for each object:
wb_list <- list()
wb_list[["pbmc10k_3p"]] <- srat_3p
wb_list[["whole_blood"]] <- srat_wb
for (i in 1:length(wb_list)) {
wb_list[[i]] <- NormalizeData(wb_list[[i]], verbose = F)
wb_list[[i]] <- FindVariableFeatures(wb_list[[i]], selection.method = "vst", nfeatures = 2000, verbose = F)
}
Here we actually do the integration. Seurat 3 does it in two steps.
wb_anchors <- FindIntegrationAnchors(object.list = wb_list, dims = 1:30)
wb_seurat <- IntegrateData(anchorset = wb_anchors, dims = 1:30)
Computing 2000 integration features Scaling features for provided objects Finding all pairwise anchors Running CCA Merging objects Finding neighborhoods Finding anchors Found 11508 anchors Filtering anchors Retained 3175 anchors Merging dataset 2 into 1 Extracting anchors for merged samples Finding integration vectors Finding integration vector weights Integrating data
rm(wb_list)
rm(wb_anchors)
Let's do the basic processing and visualization of the uncorrected dataset:
DefaultAssay(wb_seurat) <- "RNA"
wb_seurat <- NormalizeData(wb_seurat, verbose = F)
wb_seurat <- FindVariableFeatures(wb_seurat, selection.method = "vst", nfeatures = 2000, verbose = F)
wb_seurat <- ScaleData(wb_seurat, verbose = F)
wb_seurat <- RunPCA(wb_seurat, npcs = 30, verbose = F)
wb_seurat <- RunUMAP(wb_seurat, reduction = "pca", dims = 1:30, verbose = F)
DimPlot(wb_seurat, reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, before integration")
DimPlot(wb_seurat, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()
Warning message: “The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation' This message will be shown once per session”
Now, let's take a look at the integrated dataset (it's already normalized and HVGs are selected):
DefaultAssay(wb_seurat) <- "integrated"
wb_seurat <- ScaleData(wb_seurat, verbose = F)
wb_seurat <- RunPCA(wb_seurat, npcs = 30, verbose = F)
wb_seurat <- RunUMAP(wb_seurat, reduction = "pca", dims = 1:30, verbose = F)
DimPlot(wb_seurat, reduction = "umap") + plot_annotation(title = "10k 3' PBMC and white blood cells, after integration (Seurat 3)")
Let's look at some markers:
FeaturePlot(wb_seurat,c("MS4A1","LYZ","NKG7","PPBP","LTF","HBA1","FCER1A","IL7R","FCGR3B")) & scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale. Scale for colour is already present. Adding another scale for colour, which will replace the existing scale.
From the plot we can see that there are some significant cell types that are absent from PBMC dataset, but exist in the whole blood dataset. LTF gene is the most prominent marker of neutrophils, and HBA1 is a hemoglobin gene expressed in erythrocytes.
Now let's cluster the integrated matrix and look how clusters are distributed between the two sets:
wb_seurat <- FindNeighbors(wb_seurat, dims = 1:30, k.param = 10, verbose = F)
wb_seurat <- FindClusters(wb_seurat, verbose = F)
DimPlot(wb_seurat,label = T) + NoLegend()
Cluster composition shows many clusters unique to the whole blood dataset:
count_table <- table(wb_seurat@meta.data$seurat_clusters, wb_seurat@meta.data$orig.ident)
count_table
pbmc10k_3p whole_blood
0 1564 90
1 1387 211
2 1222 113
3 1217 111
4 1066 107
5 342 442
6 9 772
7 448 232
8 185 410
9 1 560
10 362 193
11 0 521
12 0 490
13 324 158
14 362 83
15 0 445
16 290 124
17 12 367
18 0 371
19 8 355
20 0 334
21 11 313
22 312 11
23 0 209
24 19 179
25 0 188
26 99 19
27 0 107
28 0 70
29 0 58
plot_integrated_clusters(wb_seurat)
Using cluster as id variables
We can take advantage of the metadata that was present in GSE149938:
meta <- wb_seurat[[]]
table(meta[meta$seurat_clusters == '5',]$cell_type) ## erythrocytes
table(meta[meta$seurat_clusters == '20',]$cell_type) ## neutrophils
table(meta[meta$seurat_clusters == '24',]$cell_type) ## plasma B cells
ery immB memB naiB preB regB 2 14 2 315 1 108
CLP immB kineNK matureN metaN MLP myeN plasma proN toxiNK
1 1 5 98 101 1 104 2 16 5
ery naiB plasma
12 1 166
rm(wb_seurat)
Similarly to the previous approaches, let's make a merged Seurat dataset, normalize and process it:
wb_harmony <- merge(srat_3p,srat_wb)
wb_harmony <- NormalizeData(wb_harmony, verbose = F)
wb_harmony <- FindVariableFeatures(wb_harmony, selection.method = "vst", nfeatures = 2000, verbose = F)
wb_harmony <- ScaleData(wb_harmony, verbose = F)
wb_harmony <- RunPCA(wb_harmony, npcs = 30, verbose = F)
wb_harmony <- RunUMAP(wb_harmony, reduction = "pca", dims = 1:30, verbose = F)
We can take a look at the PCA plot for a change, as well as distributions along the first principal component:
p1 <- DimPlot(object = wb_harmony, reduction = "pca", pt.size = .1, group.by = "orig.ident") + NoLegend()
p2 <- VlnPlot(object = wb_harmony, features = "PC_1", group.by = "orig.ident", pt.size = .1) + NoLegend()
plot_grid(p1,p2)
UMAP also shows clear differences between the datasets (we saw this plot above already):
DimPlot(wb_harmony,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, before integration")
Let's run harmony using a simple wrapper named RunHarmony from SeuratWrappers library:
wb_harmony <- wb_harmony %>% RunHarmony("orig.ident", plot_convergence = T)
Harmony 1/10 Harmony 2/10 Harmony 3/10 Harmony converged after 3 iterations Warning message: “Invalid name supplied, making object name syntactically valid. New object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details on syntax validity”
This generates the embeddings that we shall later use for all downstream analysis.
harmony_embeddings <- Embeddings(wb_harmony, 'harmony')
harmony_embeddings[1:5, 1:5]
| harmony_1 | harmony_2 | harmony_3 | harmony_4 | harmony_5 | |
|---|---|---|---|---|---|
| AAACCCACATAACTCG-1 | 3.2402720 | 0.5517086 | -4.014592 | -0.2490492 | -0.4555420 |
| AAACCCACATGTAACC-1 | 4.3992278 | -0.2653914 | 8.905441 | 5.7085193 | -2.6650973 |
| AAACCCAGTGAGTCAG-1 | 3.6924188 | 1.4199533 | -2.323807 | -1.7352863 | 1.2648235 |
| AAACGAACAGTCAGTT-1 | -0.8957237 | -3.2318868 | -2.398822 | -0.1342351 | -1.9238917 |
| AAACGAACATTCGGGC-1 | 4.6218704 | 1.6066620 | 1.914114 | -1.7354947 | 0.8168312 |
Corrected PCA and distribution:
p1 <- DimPlot(object = wb_harmony, reduction = "harmony", pt.size = .1, group.by = "orig.ident") + NoLegend()
p2 <- VlnPlot(object = wb_harmony, features = "harmony_1", group.by = "orig.ident", pt.size = .1) + NoLegend()
plot_grid(p1,p2)
Run UMAP and perform Louvain clustering:
wb_harmony <- wb_harmony %>%
RunUMAP(reduction = "harmony", dims = 1:30, verbose = F) %>%
FindNeighbors(reduction = "harmony", k.param = 10, dims = 1:30) %>%
FindClusters() %>%
identity()
Computing nearest neighbor graph Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck Number of nodes: 16883 Number of edges: 275342 Running Louvain algorithm... Maximum modularity in 10 random starts: 0.9231 Number of communities: 29 Elapsed time: 1 seconds
wb_harmony <- SetIdent(wb_harmony,value = "orig.ident")
DimPlot(wb_harmony,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, after integration (Harmony)")
DimPlot(wb_harmony, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()
Corrected results for this dataset appear to be very similar to Seurat v3:
wb_harmony <- SetIdent(wb_harmony,value = "seurat_clusters")
DimPlot(wb_harmony,label = T) + NoLegend()
More detailed cluster examination also seems to confirm this:
plot_integrated_clusters(wb_harmony)
Using cluster as id variables
rm(wb_harmony)
Finally, let's do data integration with LIGER. This step takes several minutes to run:
wb_liger <- merge(srat_3p,srat_wb)
wb_liger <- NormalizeData(wb_liger)
wb_liger <- FindVariableFeatures(wb_liger)
wb_liger <- ScaleData(wb_liger, split.by = "orig.ident", do.center = F)
wb_liger <- RunOptimizeALS(wb_liger, k = 30, lambda = 5, split.by = "orig.ident")
wb_liger <- RunQuantileNorm(wb_liger, split.by = "orig.ident")
Scaling data matrix Scaling data from split pbmc10k_3p Scaling data from split whole_blood
|======================================================================| 100% Finished in 3.81193 mins, 30 iterations. Max iterations set: 30. Final objective delta: 4.18028e-05. Best results with seed 1.
Warning message: “No columnames present in cell embeddings, setting to 'riNMF_1:30'”
We will then perform Louvain clustering (FindNeighbors and FindClusters) with the settings similar to what we have been using before:
wb_liger <- FindNeighbors(wb_liger,reduction = "iNMF",k.param = 10,dims = 1:30)
wb_liger <- FindClusters(wb_liger)
Computing nearest neighbor graph Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck Number of nodes: 16883 Number of edges: 276592 Running Louvain algorithm... Maximum modularity in 10 random starts: 0.9233 Number of communities: 38 Elapsed time: 0 seconds
Let's look at the corrected UMAP plot in a couple of different ways:
wb_liger <- RunUMAP(wb_liger, dims = 1:ncol(wb_liger[["iNMF"]]), reduction = "iNMF",verbose = F)
wb_liger <- SetIdent(wb_liger,value = "orig.ident")
DimPlot(wb_liger,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and 10k 5' PBMC cells, after integration (LIGER)")
DimPlot(wb_liger, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()
Finally, a look at distribution of datasets per cluster:
plot_integrated_clusters(wb_liger)
Using cluster as id variables
rm(wb_liger)
rm(srat_wb)
rm(srat_3p)
sessionInfo()