This article is the second part of our hands-on tutorial with Seurat. If you haven’t worked through the first part, please start there.
In this session, we’ll engage in a hands-on experience with Seurat, an R package for scRNA-seq, and set the stage for initiating scRNA-seq. Since this post is based on Seurat’s hands-on guide, you’ll be able to grasp an overview of its usage. I encourage you to dive in!
macOS Monterey (12.4), Quad-Core Intel Core i7, Memory 32GB
Preparing for the Hands-On
We’ll proceed based on the assumption that you’ve executed the code introduced in the first half. If you haven’t done so, attempting the latter half of the hands-on may result in errors, so please be cautious.
install.packages("dplyr")
install.packages("Seurat")
install.packages("patchwork")
library(dplyr)
library(Seurat)
library(patchwork)
pbmc.data <- Read10X(data.dir = "./filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
Before Executing PCA
Before employing dimensionality reduction techniques like PCA, we’ll start with a preprocessing step using linear transformation. Please execute the following command.
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
The command to run PCA on the scaled data is as follows.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
If you obtain output similar to the following for the various components, it was successful.
PC_ 1
Positive:
CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
Negative:
MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
PC_ 2
Positive:
CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
Negative:
NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
PC_ 3
Positive:
(Output continues below...)
The results might appear a bit cluttered, so let’s narrow them down to five with the next command.
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
If you achieve an output like the one below, you’ve been successful. Ensure that PC_1 through PC_5 are displayed.
> print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
PC_ 1
Positive: FTL, FTH1, COTL1, CST3, OAZ1
Negative: MALAT1, IL32, LTB, CCL5, CTSW
PC_ 2
Positive: FTL, TYROBP, S100A8, S100A9, FCN1
Negative: ACTG1, STMN1, TUBA1B, TYMS, ZWINT
PC_ 3
Positive: CD74, HLA-DRA, HLA-DPB1, HLA-DQB1, HLA-DQA1
Negative: PPBP, GNG11, SPARC, PF4, AP001189.4
PC_ 4
Positive: CD74, HLA-DQB1, HLA-DQA1, HLA-DRA, HLA-DQA2
Negative: NKG7, GZMA, GNLY, PRF1, FGFBP2
PC_ 5
Positive: ZWINT, KIAA0101, RRM2, HMGB2, AQP3
Negative: NKG7, CD74, HLA-DQA1, GNLY, SPON2
With this, the preparations for visualizing PCA are complete. Next, we’ll delve into the methods for visualizing the components.
Visualizing PCA
First, let’s determine which genes contribute to each component. Execute the following command.
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
You can see contributions starting from the genes with the highest contribution rates.
For a typical scatter plot using PC_1 and PC_2, you’ll employ the DimPlot
.
DimPlot(pbmc, reduction = "pca")
The output will be as follows.
With the DimHeatmap
function, you can inspect the main components in a heatmap style. The dims
argument allows you to control the principal component you want to display. The following is a code to display PC1.
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
Determining the ‘Dimensionality’ of the Dataset
Using JackStrawPlot
, you can calculate and visualize the p-values for each principal component.
*Note: Executing this code might take some time.
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
For ranking the contribution of each component, ElbowPlot
is available. It’s handy for visualizing how much each component contributes.
ElbowPlot(pbmc)
From the output, you can observe an ‘elbow’ around PC9-10. This indicates that the majority of the true signal is explained by the first 10 PCs.
Cell Clustering (UMAP/tSNE)
We’re now on to cell clustering. Many might associate scRNA-seq analysis with nonlinear dimensionality reduction using UMAP/tSNE. Let’s execute the following command to proceed with UMAP.
# Cell Clustering
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
# Executing UMAP
pbmc <- RunUMAP(pbmc, dims = 1:10)
The following is a detailed discussion of each function. It might come off as complex, so feel free to skip over it if you wish. A basic understanding that the module is being optimized for clustering is sufficient.
FindNeighbors Function
This function constructs a KNN graph based on the Euclidean distance in PCA space. It refines the edge weights between any two cells based on shared overlap in their local neighborhood, which is measured using the Jaccard similarity. It takes the previously defined dimensions of the dataset as input (the first 10 PCs).
FindClusters Function
For cell clustering, we then apply modularity optimization techniques like the Louvain algorithm (by default) or SLM. This function iteratively groups cells with the aim of optimizing the standard modularity function. It includes a resolution
parameter to set the “granularity” of downstream clustering – a larger value will result in more clusters.
The results can be visualized using the DimPlot
.
DimPlot(pbmc, reduction = "umap")
And with that, you’ve successfully conducted a UMAP analysis!
If you’ve reached this far, you’ve essentially grasped the basics of scRNA-seq analysis using Seurat. The hands-on guide introduces further analysis techniques, so if you’re interested, I’d highly recommend giving it a shot.
If you encounter errors even after executing the commands as instructed…
Should you face errors despite following the command instructions, it’s likely that the input value to pbmc
is incorrect. Do not panic if errors arise; simply start over with quality checking, normalization, and re-execute the commands in order. By inputting in the correct sequence, you should proceed without errors.
In Conclusion
How did you find the tutorial? Hopefully, you’ve found that using Seurat for scRNA-seq analysis is simpler than anticipated. Next, I’m planning to pen an article on how to actually find and work with scRNA-seq data. Stay tuned!