【RNA-seq】Rを使ってGEOデータベースからRNA-seqデータを取得する方法【バイオインフォマティクス】

【RNA-seq】遺伝子アノテーション情報の取得と活用方法【バイオインフォマティクス】

GEOデータベースからRNA-seqデータを取得してみたいと思いませんか?

GEOデータベースからRNA-seqデータを取得し、サンプル情報を整理する方法について解説します。公共データベースから既存の実験データを取得して再解析したい時や、自身の研究データと比較したい時に使用します。GEOデータベースの基本的な使い方のうち、データの検索方法、カウントデータの直接取得、メタデータの取得と整理までを扱います。GEOから効率的にRNA-seqデータを取得できるようになり、既存の実験データを活用した研究が可能になります。また、Rを使ったメタデータの取得・整理方法を学ぶことで、大規模なデータセットの管理が容易になります。ぜひ挑戦しましょう。

Tested Environment

macOS Monterey (12.4), Quad-Core Intel Core i7, Memory 32GB GEOquery 2.74.0

GEOデータベースとは

Gene Expression Omnibus(GEO)は、米国国立生物工学情報センター(NCBI)が運営する公共データベースであり、遺伝子発現データや関連する生物学的データの保存と提供を行っています。GEOは、マイクロアレイ、RNA-Seq、ChIP-Seqなど、さまざまな高スループット実験技術によって得られた遺伝子発現プロファイルを収集・保存しています。

GEOを使えばほしいRNA-seqデータを探すことができるのでやってみましょう。

ブラウザからGEOを使って好きなキーワードでRNA-seqデータを探す方法

GEOのトップページの右の検索バーに探したいキーワードで探します。今回は試しに「lung cancer」と打ち込んでください。

すると、検索バーのすぐ下にresultsが出てきます。GEO DataSets Databaseの方をつかいたいため、上の方の数値をクリックしましょう。

検索結果が出てきたら、左サイドバーのCustomize…をクリックし、Study Typeを出現させます。その中で「Expression profiling high throughtput sequencing」にチェックをいれて「show」を押しましょう。

この時点でNGSのデータだけに絞られます。しかしながら今の状態だとシングルセルRNA-seqデータやChip-seqデータなども一緒に出てきてしまうため、検索ワードを以下に変えてみましょう。

# 検索ワード
lung cancer and "rna-seq"

すると、RNA-seqデータだけがソートできます。今回は練習として Identification of resistance mechanisms to small-molecule inhibition of YAP/TEAD-regulated transcriptionをクリックしてみてください。

少し下にスクロールすると「SRA Run Selector」というリンクがあるのでクリックしてください。

下にスクロールするとテーブルがあり、Runという列にSRRで始まる番号が記載されています。こちらのSRR番号をダウンロードに使います。

実際のダウンロードは後ほど詳しく解説していきます。この章では、RNA-seqデータがどのように保存されていて利用できるのかを覚えておいてください。

Rを使ってGEOから直接カウントデータを取得

Rを使ってGEOから直接カウントデータを取得する方法を説明します。多くの場合、GEOには解析済みのカウントデータが公開されています。今回、GSE141413をダウンロードしてみます。

https://www.omicsdi.org/dataset/geo/GSE141413

# GEOからカウントデータを取得
gse <- getGEO("GSE141413", GSEMatrix = FALSE)

# カウントデータを含む補足ファイルをダウンロード
supp_dir <- "data/geo_suppl"
dir.create(supp_dir, recursive = TRUE, showWarnings = FALSE)
count_file <- getGEOSuppFiles("GSE141413", baseDir = supp_dir, makeDirectory = FALSE)

# ダウンロードしたカウントデータを読み込む
counts_path <- file.path(supp_dir, "GSE141413_gene_count.txt.gz")
counts_data <- read.delim(counts_path, header = TRUE)
head(counts_data)

# 出力

                   Control  NC_LM st2_LM   IFNG st2_IFNG gene_name gene_chr gene_start
ENSMUSG00000069516 2127383 504536 557353 528938    61322      Lyz2       10  117277331
ENSMUSG00000024661 2036671 561460 301334 328896   135545      Fth1       19    9982703
ENSMUSG00000029816  161502 245941 454979 437381   195531     Gpnmb        6   49036546
ENSMUSG00000007891  641791 107244 326051 347932    81647      Ctsd        7  142375911
ENSMUSG00000021939  210418 165715 397984 389062   148862      Ctsb       14   63122462
ENSMUSG00000004207  157493 121549 389289 373691   108034      Psap       10   60277627
                    gene_end gene_strand gene_length   gene_biotype
ENSMUSG00000069516 117282321           -        1316 protein_coding
ENSMUSG00000024661   9985092           +         866 protein_coding
ENSMUSG00000029816  49070929           +        6514 protein_coding
ENSMUSG00000007891 142388038           -        2351 protein_coding
ENSMUSG00000021939  63145923           +        6051 protein_coding
ENSMUSG00000004207  60302597           +        2744 protein_coding
                                                                       gene_description
ENSMUSG00000069516                         lysozyme 2 [Source:MGI Symbol;Acc:MGI:96897]
ENSMUSG00000024661       ferritin heavy polypeptide 1 [Source:MGI Symbol;Acc:MGI:95588]
ENSMUSG00000029816 glycoprotein (transmembrane) nmb [Source:MGI Symbol;Acc:MGI:1934765]
ENSMUSG00000007891                        cathepsin D [Source:MGI Symbol;Acc:MGI:88562]
ENSMUSG00000021939                        cathepsin B [Source:MGI Symbol;Acc:MGI:88561]
ENSMUSG00000004207                         prosaposin [Source:MGI Symbol;Acc:MGI:97783]
                   tf_family
ENSMUSG00000069516         -
ENSMUSG00000024661         -
ENSMUSG00000029816         -
ENSMUSG00000007891         -
ENSMUSG00000021939         -
ENSMUSG00000004207         -

Rを使ってGEOからのメタデータ取得

Rを使ってGEOからメタデータを取得して、データセットの概要を把握してみましょう。

# GEOqueryパッケージを読み込む
library(GEOquery)

# GSE141413のデータセット情報を取得
gse <- getGEO("GSE141413", GSEMatrix = TRUE)
gse
# 出力

$GSE141413_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 0 features, 5 samples
  element names: exprs
protocolData: none
phenoData
  sampleNames: GSM4202491 GSM4202492 ... GSM4202495 (5 total)
  varLabels: title geo_accession ... treatment:ch1 (47 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: GPL24247

取得したデータセット情報から、実験の詳細やサンプル情報を確認できます。

# 実験の詳細情報
exp_info <- experimentData(gse[[1]])
exp_info@name
exp_info@title
exp_info@abstract

# サンプル情報の確認
sample_info <- pData(phenoData(gse[[1]]))
head(sample_info)

# 出力

                              title geo_accession                status submission_date
GSM4202491     macrophages, control    GSM4202491 Public on Jan 01 2022     Dec 04 2019
GSM4202492       macrophages, NC_LM    GSM4202492 Public on Jan 01 2022     Dec 04 2019
GSM4202493   macrophages, siTet2_LM    GSM4202493 Public on Jan 01 2022     Dec 04 2019
GSM4202494        macrophages, IFNγ    GSM4202494 Public on Jan 01 2022     Dec 04 2019
GSM4202495 macrophages, siTet2_IFNγ    GSM4202495 Public on Jan 01 2022     Dec 04 2019
           last_update_date type channel_count source_name_ch1 organism_ch1
GSM4202491      Jan 01 2022  SRA             1      macrophage Mus musculus
GSM4202492      Jan 01 2022  SRA             1      macrophage Mus musculus
GSM4202493      Jan 01 2022  SRA             1      macrophage Mus musculus
GSM4202494      Jan 01 2022  SRA             1      macrophage Mus musculus
GSM4202495      Jan 01 2022  SRA             1      macrophage Mus musculus
             characteristics_ch1 characteristics_ch1.1 characteristics_ch1.2
GSM4202491 cell type: macrophage        age: 6-8 weeks       strain: C57BL/6
GSM4202492 cell type: macrophage        age: 6-8 weeks       strain: C57BL/6
GSM4202493 cell type: macrophage        age: 6-8 weeks       strain: C57BL/6
GSM4202494 cell type: macrophage        age: 6-8 weeks       strain: C57BL/6
GSM4202495 cell type: macrophage        age: 6-8 weeks       strain: C57BL/6
           characteristics_ch1.3                                 characteristics_ch1.4
GSM4202491   genotype: wild type                                    treatment: control
GSM4202492   genotype: wild type        treatment: si-NC followed by 12 h LM infection
GSM4202493   genotype: wild type      treatment: si-Tet2 followed by 12 h LM infection
GSM4202494   genotype: wild type   treatment: si-NC followed by 12 h IFN-γ stimulation
GSM4202495   genotype: wild type treatment: si-Tet2 followed by 12 h IFN-γ stimulation
                                                                                      growth_protocol_ch1
GSM4202491 PMs were generated on DMEM medium supplemented with 10% Gibico FBS and Penicillin-Streptomycin
GSM4202492 PMs were generated on DMEM medium supplemented with 10% Gibico FBS and Penicillin-Streptomycin
GSM4202493 PMs were generated on DMEM medium supplemented with 10% Gibico FBS and Penicillin-Streptomycin
GSM4202494 PMs were generated on DMEM medium supplemented with 10% Gibico FBS and Penicillin-Streptomycin
GSM4202495 PMs were generated on DMEM medium supplemented with 10% Gibico FBS and Penicillin-Streptomycin
           molecule_ch1
GSM4202491    total RNA
GSM4202492    total RNA
GSM4202493    total RNA
GSM4202494    total RNA
GSM4202495    total RNA
                                                                             extract_protocol_ch1
GSM4202491 Peritoneal perfusion to collect mice PMs,and RNA was harvested by using Trizol reagent
GSM4202492 Peritoneal perfusion to collect mice PMs,and RNA was harvested by using Trizol reagent
GSM4202493 Peritoneal perfusion to collect mice PMs,and RNA was harvested by using Trizol reagent
GSM4202494 Peritoneal perfusion to collect mice PMs,and RNA was harvested by using Trizol reagent
GSM4202495 Peritoneal perfusion to collect mice PMs,and RNA was harvested by using Trizol reagent
                                                                 extract_protocol_ch1.1
GSM4202491 RNA libraries were prepared for sequencing using standard Illumina protocols
GSM4202492 RNA libraries were prepared for sequencing using standard Illumina protocols
GSM4202493 RNA libraries were prepared for sequencing using standard Illumina protocols
GSM4202494 RNA libraries were prepared for sequencing using standard Illumina protocols
GSM4202495 RNA libraries were prepared for sequencing using standard Illumina protocols
           taxid_ch1 description                  data_processing
GSM4202491     10090     Control Basecalls performed using CASAVA
GSM4202492     10090       NC_LM Basecalls performed using CASAVA
GSM4202493     10090      st2_LM Basecalls performed using CASAVA
GSM4202494     10090        IFNG Basecalls performed using CASAVA
GSM4202495     10090    st2_IFNG Basecalls performed using CASAVA
                                                                                                                                                                                                                                            data_processing.1
GSM4202491 Paired-end clean reads were aligned to the reference genome using Hisat2 v2.0.5 with reference genome and gene model annotation files were downloaded from genome website directly and index of the reference genome was built using Hisat2 v2.0.5
GSM4202492 Paired-end clean reads were aligned to the reference genome using Hisat2 v2.0.5 with reference genome and gene model annotation files were downloaded from genome website directly and index of the reference genome was built using Hisat2 v2.0.5
GSM4202493 Paired-end clean reads were aligned to the reference genome using Hisat2 v2.0.5 with reference genome and gene model annotation files were downloaded from genome website directly and index of the reference genome was built using Hisat2 v2.0.5
GSM4202494 Paired-end clean reads were aligned to the reference genome using Hisat2 v2.0.5 with reference genome and gene model annotation files were downloaded from genome website directly and index of the reference genome was built using Hisat2 v2.0.5
GSM4202495 Paired-end clean reads were aligned to the reference genome using Hisat2 v2.0.5 with reference genome and gene model annotation files were downloaded from genome website directly and index of the reference genome was built using Hisat2 v2.0.5
                                                                                                                                                                                    data_processing.2
GSM4202491 The reads numbers mapped to each gene were counted using featureCounts v1.5.0-p3, and FPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene
GSM4202492 The reads numbers mapped to each gene were counted using featureCounts v1.5.0-p3, and FPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene
GSM4202493 The reads numbers mapped to each gene were counted using featureCounts v1.5.0-p3, and FPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene
GSM4202494 The reads numbers mapped to each gene were counted using featureCounts v1.5.0-p3, and FPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene
GSM4202495 The reads numbers mapped to each gene were counted using featureCounts v1.5.0-p3, and FPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene
            data_processing.3
GSM4202491 Genome_build: mm10
GSM4202492 Genome_build: mm10
GSM4202493 Genome_build: mm10
GSM4202494 Genome_build: mm10
GSM4202495 Genome_build: mm10
                                                                                                            data_processing.4
GSM4202491 Supplementary_files_format_and_content: The processed data file contains abundance measurements of different genes
GSM4202492 Supplementary_files_format_and_content: The processed data file contains abundance measurements of different genes
GSM4202493 Supplementary_files_format_and_content: The processed data file contains abundance measurements of different genes
GSM4202494 Supplementary_files_format_and_content: The processed data file contains abundance measurements of different genes
GSM4202495 Supplementary_files_format_and_content: The processed data file contains abundance measurements of different genes
           platform_id contact_name        contact_email   contact_institute
GSM4202491    GPL24247       xu,,li lixu@westlake.edu.cn westlake university
GSM4202492    GPL24247       xu,,li lixu@westlake.edu.cn westlake university
GSM4202493    GPL24247       xu,,li lixu@westlake.edu.cn westlake university
GSM4202494    GPL24247       xu,,li lixu@westlake.edu.cn westlake university
GSM4202495    GPL24247       xu,,li lixu@westlake.edu.cn westlake university
                  contact_address contact_city     contact_state contact_zip/postal_code
GSM4202491 shilongshan street 18#     hangzhou zhejiang province                  310000
GSM4202492 shilongshan street 18#     hangzhou zhejiang province                  310000
GSM4202493 shilongshan street 18#     hangzhou zhejiang province                  310000
GSM4202494 shilongshan street 18#     hangzhou zhejiang province                  310000
GSM4202495 shilongshan street 18#     hangzhou zhejiang province                  310000
           contact_country data_row_count      instrument_model library_selection
GSM4202491           China              0 Illumina NovaSeq 6000              cDNA
GSM4202492           China              0 Illumina NovaSeq 6000              cDNA
GSM4202493           China              0 Illumina NovaSeq 6000              cDNA
GSM4202494           China              0 Illumina NovaSeq 6000              cDNA
GSM4202495           China              0 Illumina NovaSeq 6000              cDNA
           library_source library_strategy
GSM4202491 transcriptomic          RNA-Seq
GSM4202492 transcriptomic          RNA-Seq
GSM4202493 transcriptomic          RNA-Seq
GSM4202494 transcriptomic          RNA-Seq
GSM4202495 transcriptomic          RNA-Seq
                                                                 relation
GSM4202491 BioSample: <https://www.ncbi.nlm.nih.gov/biosample/SAMN13476496>
GSM4202492 BioSample: <https://www.ncbi.nlm.nih.gov/biosample/SAMN13476495>
GSM4202493 BioSample: <https://www.ncbi.nlm.nih.gov/biosample/SAMN13476494>
GSM4202494 BioSample: <https://www.ncbi.nlm.nih.gov/biosample/SAMN13476493>
GSM4202495 BioSample: <https://www.ncbi.nlm.nih.gov/biosample/SAMN13476492>
                                                      relation.1 supplementary_file_1
GSM4202491 SRA: <https://www.ncbi.nlm.nih.gov/sra?term=SRX7264639>                 NONE
GSM4202492 SRA: <https://www.ncbi.nlm.nih.gov/sra?term=SRX7264640>                 NONE
GSM4202493 SRA: <https://www.ncbi.nlm.nih.gov/sra?term=SRX7264641>                 NONE
GSM4202494 SRA: <https://www.ncbi.nlm.nih.gov/sra?term=SRX7264642>                 NONE
GSM4202495 SRA: <https://www.ncbi.nlm.nih.gov/sra?term=SRX7264643>                 NONE
             age:ch1 cell type:ch1 genotype:ch1 strain:ch1
GSM4202491 6-8 weeks    macrophage    wild type    C57BL/6
GSM4202492 6-8 weeks    macrophage    wild type    C57BL/6
GSM4202493 6-8 weeks    macrophage    wild type    C57BL/6
GSM4202494 6-8 weeks    macrophage    wild type    C57BL/6
GSM4202495 6-8 weeks    macrophage    wild type    C57BL/6
                                        treatment:ch1
GSM4202491                                    control
GSM4202492        si-NC followed by 12 h LM infection
GSM4202493      si-Tet2 followed by 12 h LM infection
GSM4202494   si-NC followed by 12 h IFN-γ stimulation
GSM4202495 si-Tet2 followed by 12 h IFN-γ stimulation

このままだと見づらいため、解析に必要なサンプル情報を整理します。

# 必要な列のみを抽出
samples <- data.frame(
  sample_id = sample_info$geo_accession,
  title = sample_info$title,
  group = ifelse(grepl("siTET2", sample_info$title), "siTET2", "siControl"),
  sra_id = gsub(".*(SR[AX]\\\\\\\\d+).*", "\\\\\\\\1", sample_info$relation.1),
  row.names = sample_info$geo_accession
)

samples$sra_id <- gsub("SRA: ", "", samples$sra_id)

# サンプル情報の確認
print(samples)

# 出力

            sample_id                    title     group     sra_id
GSM4202491 GSM4202491     macrophages, control siControl SRX7264639
GSM4202492 GSM4202492       macrophages, NC_LM siControl SRX7264640
GSM4202493 GSM4202493   macrophages, siTet2_LM siControl SRX7264641
GSM4202494 GSM4202494        macrophages, IFNγ siControl SRX7264642
GSM4202495 GSM4202495 macrophages, siTet2_IFNγ siControl SRX7264643

これでダウンロードができました。

最後に


いかがだったでしょうか。GEOを使って好きなRNA-seqデータを探して解析する準備をしましょう!

次回、詳しい解析方法を執筆予定なので、ぜひ見ていただけますと幸いです。

2 COMMENTS

大野彰久

GSE152418のデータセット情報を取得 と記載がありますが、このGSE番号は間違いでしょうか?

返信する
koreeda

大野彰久様

ご連絡ありがとうございます。コメントアウトの文章はGSE152418ではなくGSE141413でしたので修正いたしました。
ご指摘いただき誠にありがとうございます。

返信する

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です