This article is the second part of our hands-on tutorial with Seurat. If you haven’t worked through the first part, please start there.
In this session, we’ll engage in a hands-on experience with Seurat, an R package for scRNA-seq, and set the stage for initiating scRNA-seq. Since this post is based on Seurat’s hands-on guide, you’ll be able to grasp an overview of its usage. I encourage you to dive in!
macOS Monterey (12.4), Quad-Core Intel Core i7, Memory 32GB
We’ll proceed based on the assumption that you’ve executed the code introduced in the first half. If you haven’t done so, attempting the latter half of the hands-on may result in errors, so please be cautious.
install.packages("dplyr") install.packages("Seurat") install.packages("patchwork") library(dplyr) library(Seurat) library(patchwork) pbmc.data <- Read10X(data.dir = "./filtered_gene_bc_matrices/hg19/") pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) pbmc pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") plot1 + plot2 pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
Before Executing PCA
Before employing dimensionality reduction techniques like PCA, we’ll start with a preprocessing step using linear transformation. Please execute the following command.
all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features = all.genes)
The command to run PCA on the scaled data is as follows.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
If you obtain output similar to the following for the various components, it was successful.
PC_ 1 Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 PC_ 2 Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC PC_ 3 Positive: (Output continues below...)
The results might appear a bit cluttered, so let’s narrow them down to five with the next command.
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
If you achieve an output like the one below, you’ve been successful. Ensure that PC_1 through PC_5 are displayed.
> print(pbmc[["pca"]], dims = 1:5, nfeatures = 5) PC_ 1 Positive: FTL, FTH1, COTL1, CST3, OAZ1 Negative: MALAT1, IL32, LTB, CCL5, CTSW PC_ 2 Positive: FTL, TYROBP, S100A8, S100A9, FCN1 Negative: ACTG1, STMN1, TUBA1B, TYMS, ZWINT PC_ 3 Positive: CD74, HLA-DRA, HLA-DPB1, HLA-DQB1, HLA-DQA1 Negative: PPBP, GNG11, SPARC, PF4, AP001189.4 PC_ 4 Positive: CD74, HLA-DQB1, HLA-DQA1, HLA-DRA, HLA-DQA2 Negative: NKG7, GZMA, GNLY, PRF1, FGFBP2 PC_ 5 Positive: ZWINT, KIAA0101, RRM2, HMGB2, AQP3 Negative: NKG7, CD74, HLA-DQA1, GNLY, SPON2
With this, the preparations for visualizing PCA are complete. Next, we’ll delve into the methods for visualizing the components.
First, let’s determine which genes contribute to each component. Execute the following command.
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
You can see contributions starting from the genes with the highest contribution rates.
For a typical scatter plot using PC_1 and PC_2, you’ll employ the
DimPlot(pbmc, reduction = "pca")
The output will be as follows.
DimHeatmap function, you can inspect the main components in a heatmap style. The
dims argument allows you to control the principal component you want to display. The following is a code to display PC1.
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
Determining the ‘Dimensionality’ of the Dataset
JackStrawPlot, you can calculate and visualize the p-values for each principal component.
*Note: Executing this code might take some time.
pbmc <- JackStraw(pbmc, num.replicate = 100) pbmc <- ScoreJackStraw(pbmc, dims = 1:20) JackStrawPlot(pbmc, dims = 1:15)
For ranking the contribution of each component,
ElbowPlot is available. It’s handy for visualizing how much each component contributes.
From the output, you can observe an ‘elbow’ around PC9-10. This indicates that the majority of the true signal is explained by the first 10 PCs.
Cell Clustering (UMAP/tSNE)
We’re now on to cell clustering. Many might associate scRNA-seq analysis with nonlinear dimensionality reduction using UMAP/tSNE. Let’s execute the following command to proceed with UMAP.
# Cell Clustering pbmc <- FindNeighbors(pbmc, dims = 1:10) pbmc <- FindClusters(pbmc, resolution = 0.5) # Executing UMAP pbmc <- RunUMAP(pbmc, dims = 1:10)
The following is a detailed discussion of each function. It might come off as complex, so feel free to skip over it if you wish. A basic understanding that the module is being optimized for clustering is sufficient.
This function constructs a KNN graph based on the Euclidean distance in PCA space. It refines the edge weights between any two cells based on shared overlap in their local neighborhood, which is measured using the Jaccard similarity. It takes the previously defined dimensions of the dataset as input (the first 10 PCs).
For cell clustering, we then apply modularity optimization techniques like the Louvain algorithm (by default) or SLM. This function iteratively groups cells with the aim of optimizing the standard modularity function. It includes a
resolution parameter to set the “granularity” of downstream clustering – a larger value will result in more clusters.
The results can be visualized using the
DimPlot(pbmc, reduction = "umap")
And with that, you’ve successfully conducted a UMAP analysis!
If you’ve reached this far, you’ve essentially grasped the basics of scRNA-seq analysis using Seurat. The hands-on guide introduces further analysis techniques, so if you’re interested, I’d highly recommend giving it a shot.
If you encounter errors even after executing the commands as instructed…
Should you face errors despite following the command instructions, it’s likely that the input value to
pbmc is incorrect. Do not panic if errors arise; simply start over with quality checking, normalization, and re-execute the commands in order. By inputting in the correct sequence, you should proceed without errors.
How did you find the tutorial? Hopefully, you’ve found that using Seurat for scRNA-seq analysis is simpler than anticipated. Next, I’m planning to pen an article on how to actually find and work with scRNA-seq data. Stay tuned!