scRNA-seq Dataset Integration using R-based tools¶

In [1]:
library(knitr)
opts_chunk$set(cache = TRUE, out.width='100%', fig.align = 'center')

The code below will install some packages if you don't have them already.

In [2]:
library(Seurat)
library(SeuratWrappers)
library(DropletUtils)
library(ggplot2)
library(RColorBrewer)
library(knitr)
library(dplyr)
library(harmony)
library(rliger)
library(glmGamPoi)
library(patchwork)
library(cowplot)
Warning message:
“package ‘Seurat’ was built under R version 4.2.2”
Attaching SeuratObject

Warning message:
“package ‘DropletUtils’ was built under R version 4.2.2”
Loading required package: SingleCellExperiment

Warning message:
“package ‘SingleCellExperiment’ was built under R version 4.2.2”
Loading required package: SummarizedExperiment

Warning message:
“package ‘SummarizedExperiment’ was built under R version 4.2.2”
Loading required package: MatrixGenerics

Loading required package: matrixStats

Warning message:
“package ‘matrixStats’ was built under R version 4.2.2”

Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars


Loading required package: GenomicRanges

Warning message:
“package ‘GenomicRanges’ was built under R version 4.2.2”
Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min


Loading required package: S4Vectors


Attaching package: ‘S4Vectors’


The following objects are masked from ‘package:base’:

    expand.grid, I, unname


Loading required package: IRanges

Loading required package: GenomeInfoDb

Warning message:
“package ‘GenomeInfoDb’ was built under R version 4.2.2”
Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.



Attaching package: ‘Biobase’


The following object is masked from ‘package:MatrixGenerics’:

    rowMedians


The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians



Attaching package: ‘SummarizedExperiment’


The following object is masked from ‘package:SeuratObject’:

    Assays


The following object is masked from ‘package:Seurat’:

    Assays


Warning message:
“package ‘ggplot2’ was built under R version 4.2.2”
Warning message:
“package ‘knitr’ was built under R version 4.2.2”
Warning message:
“package ‘dplyr’ was built under R version 4.2.2”

Attaching package: ‘dplyr’


The following object is masked from ‘package:Biobase’:

    combine


The following objects are masked from ‘package:GenomicRanges’:

    intersect, setdiff, union


The following object is masked from ‘package:GenomeInfoDb’:

    intersect


The following objects are masked from ‘package:IRanges’:

    collapse, desc, intersect, setdiff, slice, union


The following objects are masked from ‘package:S4Vectors’:

    first, intersect, rename, setdiff, setequal, union


The following objects are masked from ‘package:BiocGenerics’:

    combine, intersect, setdiff, union


The following object is masked from ‘package:matrixStats’:

    count


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: Rcpp

Warning message:
“package ‘Rcpp’ was built under R version 4.2.2”
Loading required package: cowplot

Loading required package: Matrix

Warning message:
“package ‘Matrix’ was built under R version 4.2.2”

Attaching package: ‘Matrix’


The following object is masked from ‘package:S4Vectors’:

    expand


Loading required package: patchwork


Attaching package: ‘patchwork’


The following object is masked from ‘package:cowplot’:

    align_plots



Attaching package: ‘rliger’


The following object is masked from ‘package:BiocGenerics’:

    normalize


Warning message:
“package ‘glmGamPoi’ was built under R version 4.2.2”

Attaching package: ‘glmGamPoi’


The following object is masked from ‘package:dplyr’:

    vars


The following object is masked from ‘package:ggplot2’:

    vars


Let's also load a custom function, plot_integrated_clusters, that is stored in /data as custom_seurat_functions.R:

In [3]:
source("./data/cellgeni/custom_seurat_functions.R")

Introduction¶

As more and more scRNA-seq datasets become available, carrying merged_seurat comparisons between them is key. There are two main approaches to comparing scRNASeq datasets. The first approach is "label-centric" which is focused on trying to identify equivalent cell-types/states across datasets by comparing individual cells or groups of cells. The other approach is "cross-dataset normalization" which attempts to computationally remove experiment-specific technical/biological effects so that data from multiple experiments can be combined and jointly analyzed.

We will illustrate this process using two very different datasets: 1) 10k PBMC (3' v3 chemistry) downloaded from 10x Genomics website; 2) major cell populations isolated from whole human blood using FACS, and sequenced using STRT-seq (GSE149938).

Although we already have all the necessary files in our /data folder, we can download the necessary files from GEO database:

Let's read the processed file available via GEO, and the 10x file we've processed using SoupX (see another notebook in this repository):

In [4]:
umi_gz <- gzfile("./data/cellgeni/GSE149938_umi_matrix.csv.gz",'rt')
umi <- read.csv(umi_gz,check.names = F,quote = "")
matrix_3p    <- Read10X("./data/cellgeni/soupX_pbmc10k_filt")

Next, let's make Seurat objects and re-define some of the metadata columns (GEO dataset simply puts the cell type into the orig.ident slot, which will interfere with what we want to do next):

In [5]:
srat_wb <- CreateSeuratObject(t(umi),project = "whole_blood")
srat_3p <- CreateSeuratObject(matrix_3p,project = "pbmc10k_3p")
rm(umi_gz)
rm(umi)
rm(matrix_3p)
Warning message:
“Non-unique features (rownames) present in the input matrix, making unique”
In [6]:
colnames(srat_wb@meta.data)[1] <- "cell_type"
srat_wb@meta.data$orig.ident <- "whole_blood"
srat_wb@meta.data$orig.ident <- as.factor(srat_wb@meta.data$orig.ident)
head(srat_wb[[]])
A data.frame: 6 × 4
cell_typenCount_RNAnFeature_RNAorig.ident
<fct><dbl><int><fct>
BNK_spBM1_L1_bar25BNK 244941869whole_blood
BNK_spBM1_L1_bar26BNK 619802051whole_blood
BNK_spBM1_L1_bar27BNK1243823872whole_blood
BNK_spBM1_L1_bar28BNK 81441475whole_blood
BNK_spBM1_L1_bar29BNK 536122086whole_blood
BNK_spBM1_L1_bar30BNK 335822038whole_blood

Do basic quality control. STRT-Seq is quite different from 10x and has a lot more detected genes per cell. Also, for some reason we don't have the MT genes in the quantified matrix of the whole blood dataset. That's unfortunate, but not critical.

In [7]:
srat_wb <- SetIdent(srat_wb,value = "orig.ident")

srat_wb[["percent.mt"]] <- PercentageFeatureSet(srat_wb, pattern = "^MT-")
srat_wb[["percent.rbp"]] <- PercentageFeatureSet(srat_wb, pattern = "^RP[SL]")
srat_3p[["percent.mt"]] <- PercentageFeatureSet(srat_3p, pattern = "^MT-")
srat_3p[["percent.rbp"]] <- PercentageFeatureSet(srat_3p, pattern = "^RP[SL]")

VlnPlot(srat_wb, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)
VlnPlot(srat_3p, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)
Warning message in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents = idents, :
“All cells have the same value of percent.mt.”

The annotation that was used to process the GEO whole blood dataset is quite different from the Cell Ranger GRCh38-2020A. Let's see how many common genes are there:

In [8]:
table(rownames(srat_3p) %in% rownames(srat_wb))
common_genes <- rownames(srat_3p)[rownames(srat_3p) %in% rownames(srat_wb)]
FALSE  TRUE 
18050 18551 

Let's filter the cells with too high or too low number of genes, or too high MT gene content. Generally speaking, cleaner datasets (both in terms of ambient RNA and in terms of low-quality cells) integrate better.

Also, let's limit the individual matrices to common genes only:

In [9]:
srat_3p <- subset(srat_3p, subset = nFeature_RNA > 1000 & nFeature_RNA < 6000 & percent.mt < 15)
srat_wb <- subset(srat_wb, subset = nFeature_RNA > 1000)

srat_3p <- srat_3p[rownames(srat_3p) %in% common_genes,]
srat_wb <- srat_wb[rownames(srat_wb) %in% common_genes,]

As previously for Seurat v3, let's make a list and normalize/find HVG for each object:

In [10]:
wb_list <- list()
wb_list[["pbmc10k_3p"]]   <- srat_3p
wb_list[["whole_blood"]]  <- srat_wb

for (i in 1:length(wb_list)) {
  wb_list[[i]] <- NormalizeData(wb_list[[i]], verbose = F)
  wb_list[[i]] <- FindVariableFeatures(wb_list[[i]], selection.method = "vst", nfeatures = 2000, verbose = F)
}

Here we actually do the integration. Seurat 3 does it in two steps.

In [11]:
wb_anchors <- FindIntegrationAnchors(object.list = wb_list, dims = 1:30)
wb_seurat  <- IntegrateData(anchorset = wb_anchors, dims = 1:30)
Computing 2000 integration features

Scaling features for provided objects

Finding all pairwise anchors

Running CCA

Merging objects

Finding neighborhoods

Finding anchors

	Found 11508 anchors

Filtering anchors

	Retained 3175 anchors

Merging dataset 2 into 1

Extracting anchors for merged samples

Finding integration vectors

Finding integration vector weights

Integrating data

In [12]:
rm(wb_list)
rm(wb_anchors)

Let's do the basic processing and visualization of the uncorrected dataset:

In [13]:
DefaultAssay(wb_seurat) <- "RNA"
wb_seurat <- NormalizeData(wb_seurat, verbose = F)
wb_seurat <- FindVariableFeatures(wb_seurat, selection.method = "vst", nfeatures = 2000, verbose = F)
wb_seurat <- ScaleData(wb_seurat, verbose = F)
wb_seurat <- RunPCA(wb_seurat, npcs = 30, verbose = F)
wb_seurat <- RunUMAP(wb_seurat, reduction = "pca", dims = 1:30, verbose = F)

DimPlot(wb_seurat, reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, before integration")
DimPlot(wb_seurat, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()
Warning message:
“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”

Now, let's take a look at the integrated dataset (it's already normalized and HVGs are selected):

In [14]:
DefaultAssay(wb_seurat) <- "integrated"
wb_seurat <- ScaleData(wb_seurat, verbose = F)
wb_seurat <- RunPCA(wb_seurat, npcs = 30, verbose = F)
wb_seurat <- RunUMAP(wb_seurat, reduction = "pca", dims = 1:30, verbose = F)

DimPlot(wb_seurat, reduction = "umap") + plot_annotation(title = "10k 3' PBMC and white blood cells, after integration (Seurat 3)")

Let's look at some markers:

In [15]:
FeaturePlot(wb_seurat,c("MS4A1","LYZ","NKG7","PPBP","LTF","HBA1","FCER1A","IL7R","FCGR3B")) & scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

From the plot we can see that there are some significant cell types that are absent from PBMC dataset, but exist in the whole blood dataset. LTF gene is the most prominent marker of neutrophils, and HBA1 is a hemoglobin gene expressed in erythrocytes.

Now let's cluster the integrated matrix and look how clusters are distributed between the two sets:

In [16]:
wb_seurat <- FindNeighbors(wb_seurat, dims = 1:30, k.param = 10, verbose = F)
wb_seurat <- FindClusters(wb_seurat, verbose = F)
DimPlot(wb_seurat,label = T) + NoLegend()

Cluster composition shows many clusters unique to the whole blood dataset:

In [17]:
count_table <- table(wb_seurat@meta.data$seurat_clusters, wb_seurat@meta.data$orig.ident)
count_table
    
     pbmc10k_3p whole_blood
  0        1564          90
  1        1387         211
  2        1222         113
  3        1217         111
  4        1066         107
  5         342         442
  6           9         772
  7         448         232
  8         185         410
  9           1         560
  10        362         193
  11          0         521
  12          0         490
  13        324         158
  14        362          83
  15          0         445
  16        290         124
  17         12         367
  18          0         371
  19          8         355
  20          0         334
  21         11         313
  22        312          11
  23          0         209
  24         19         179
  25          0         188
  26         99          19
  27          0         107
  28          0          70
  29          0          58
In [18]:
plot_integrated_clusters(wb_seurat)
Using cluster as id variables

We can take advantage of the metadata that was present in GSE149938:

In [19]:
meta <- wb_seurat[[]]
table(meta[meta$seurat_clusters == '5',]$cell_type) ## erythrocytes
table(meta[meta$seurat_clusters == '20',]$cell_type) ## neutrophils
table(meta[meta$seurat_clusters == '24',]$cell_type) ## plasma B cells
 ery immB memB naiB preB regB 
   2   14    2  315    1  108 
    CLP    immB  kineNK matureN   metaN     MLP    myeN  plasma    proN  toxiNK 
      1       1       5      98     101       1     104       2      16       5 
   ery   naiB plasma 
    12      1    166 
In [20]:
rm(wb_seurat)

Harmony, 3' 10k PBMC cells and whole blood STRT-Seq¶

Similarly to the previous approaches, let's make a merged Seurat dataset, normalize and process it:

In [21]:
wb_harmony    <- merge(srat_3p,srat_wb)

wb_harmony <- NormalizeData(wb_harmony, verbose = F)
wb_harmony <- FindVariableFeatures(wb_harmony, selection.method = "vst", nfeatures = 2000, verbose = F)
wb_harmony <- ScaleData(wb_harmony, verbose = F)
wb_harmony <- RunPCA(wb_harmony, npcs = 30, verbose = F)
wb_harmony <- RunUMAP(wb_harmony, reduction = "pca", dims = 1:30, verbose = F)

We can take a look at the PCA plot for a change, as well as distributions along the first principal component:

In [22]:
p1 <- DimPlot(object = wb_harmony, reduction = "pca", pt.size = .1, group.by = "orig.ident") + NoLegend()
p2 <- VlnPlot(object = wb_harmony, features = "PC_1", group.by = "orig.ident", pt.size = .1) + NoLegend()
plot_grid(p1,p2)

UMAP also shows clear differences between the datasets (we saw this plot above already):

In [25]:
DimPlot(wb_harmony,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, before integration")

Let's run harmony using a simple wrapper named RunHarmony from SeuratWrappers library:

In [26]:
wb_harmony <- wb_harmony %>% RunHarmony("orig.ident", plot_convergence = T)
Harmony 1/10

Harmony 2/10

Harmony 3/10

Harmony converged after 3 iterations

Warning message:
“Invalid name supplied, making object name syntactically valid. New object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details on syntax validity”

This generates the embeddings that we shall later use for all downstream analysis.

In [27]:
harmony_embeddings <- Embeddings(wb_harmony, 'harmony')
harmony_embeddings[1:5, 1:5]
A matrix: 5 × 5 of type dbl
harmony_1harmony_2harmony_3harmony_4harmony_5
AAACCCACATAACTCG-1 3.2402720 0.5517086-4.014592-0.2490492-0.4555420
AAACCCACATGTAACC-1 4.3992278-0.2653914 8.905441 5.7085193-2.6650973
AAACCCAGTGAGTCAG-1 3.6924188 1.4199533-2.323807-1.7352863 1.2648235
AAACGAACAGTCAGTT-1-0.8957237-3.2318868-2.398822-0.1342351-1.9238917
AAACGAACATTCGGGC-1 4.6218704 1.6066620 1.914114-1.7354947 0.8168312

Corrected PCA and distribution:

In [28]:
p1 <- DimPlot(object = wb_harmony, reduction = "harmony", pt.size = .1, group.by = "orig.ident") + NoLegend()
p2 <- VlnPlot(object = wb_harmony, features = "harmony_1", group.by = "orig.ident", pt.size = .1) + NoLegend()
plot_grid(p1,p2)

Run UMAP and perform Louvain clustering:

In [29]:
wb_harmony <- wb_harmony %>%
  RunUMAP(reduction = "harmony", dims = 1:30, verbose = F) %>%
  FindNeighbors(reduction = "harmony", k.param = 10, dims = 1:30) %>%
  FindClusters() %>%
  identity()
Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 16883
Number of edges: 275342

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9231
Number of communities: 29
Elapsed time: 1 seconds
In [30]:
wb_harmony <- SetIdent(wb_harmony,value = "orig.ident")
DimPlot(wb_harmony,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and whole blood, after integration (Harmony)")
DimPlot(wb_harmony, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()

Corrected results for this dataset appear to be very similar to Seurat v3:

In [31]:
wb_harmony <- SetIdent(wb_harmony,value = "seurat_clusters")
DimPlot(wb_harmony,label = T) + NoLegend()

More detailed cluster examination also seems to confirm this:

In [32]:
plot_integrated_clusters(wb_harmony)
Using cluster as id variables

In [33]:
rm(wb_harmony)

LIGER, 3' 10k PBMC cells and whole blood STRT-Seq¶

Finally, let's do data integration with LIGER. This step takes several minutes to run:

In [34]:
wb_liger    <- merge(srat_3p,srat_wb)

wb_liger    <- NormalizeData(wb_liger)
wb_liger    <- FindVariableFeatures(wb_liger)
wb_liger    <- ScaleData(wb_liger, split.by = "orig.ident", do.center = F)
wb_liger    <- RunOptimizeALS(wb_liger, k = 30, lambda = 5, split.by = "orig.ident")
wb_liger    <- RunQuantileNorm(wb_liger, split.by = "orig.ident")
Scaling data matrix

Scaling data from split pbmc10k_3p

Scaling data from split whole_blood

  |======================================================================| 100%
Finished in 3.81193 mins, 30 iterations.
Max iterations set: 30.
Final objective delta: 4.18028e-05.
Best results with seed 1.
Warning message:
“No columnames present in cell embeddings, setting to 'riNMF_1:30'”

We will then perform Louvain clustering (FindNeighbors and FindClusters) with the settings similar to what we have been using before:

In [35]:
wb_liger    <- FindNeighbors(wb_liger,reduction = "iNMF",k.param = 10,dims = 1:30)
wb_liger    <- FindClusters(wb_liger)
Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 16883
Number of edges: 276592

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9233
Number of communities: 38
Elapsed time: 0 seconds

Let's look at the corrected UMAP plot in a couple of different ways:

In [36]:
wb_liger    <- RunUMAP(wb_liger, dims = 1:ncol(wb_liger[["iNMF"]]), reduction = "iNMF",verbose = F)
wb_liger <- SetIdent(wb_liger,value = "orig.ident")
DimPlot(wb_liger,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and 10k 5' PBMC cells, after integration (LIGER)")
DimPlot(wb_liger, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()

Finally, a look at distribution of datasets per cluster:

In [37]:
plot_integrated_clusters(wb_liger)
Using cluster as id variables

In [38]:
rm(wb_liger)
rm(srat_wb)
rm(srat_3p)

sessionInfo()¶

View session info
In [39]:
sessionInfo()