vignettes/RNAseqAnalysis.Rmd
RNAseqAnalysis.Rmd
This vignette walks through analysis of the B cell lymphoma RNA-seq data included in the future Andrews et al. paper. It uses the convenience functions included in this package to streamline the process.
Several command line tools as well as python are also utilized - each code block should indicate whether it is bash
, R
, or python
code. If your system is set up properly, you can still run these code chunks in Rmarkdown files within RStudio, assuming you have all of the software installed.
The large majority of this vignette can be skipped if you want, the raw gene counts (or FASTQs if you want the whole shebang) is available in GSE145842. That accession also contains a table with the normalized gene counts as output from DESeq2, if you just want to spot-check your gene of interest in this dataset.
Required Software:
That is, if you ask me a question and haven’t read these first, I’m going to tell you to read them.
I prefer the salmon
-> DESeq2
route, as it’s fast and accurate. This requires R, salmon v0.14, and DESeq2 (along with a few other packages). salmon
needs mashmap
and bedtools
as well if generating “decoy-aware” transcriptomes. The Google can show you how to install those.
Newer versions of salmon recommend using a decoy-aware transcriptome to generate the index file to be used for selective alignment. This process is a bit annoying, so the authors have pre-computed some decoy transcriptomes for hg19 and hg38 with different annotations. In this case, I used the GENCODE v31 annotations that had been lifted over to hg19, so I had to generate them myself.
This had to be run on a cluster with 128 GB RAM. If you can avoid this step by using their pre-generated decoys, I recommend doing so.
You will likely have a different directory stucture, so paths will have to be edited as appropriate.
First, we’ll download all the annotation info and sequence data we need.
# bash
mkdir -p RNA_seq/ref
# annotations
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gtf.gz
mv gencode.v31lift37.annotation.gtf.gz ./RNA_Seq/ref/gencode.v31lift37.annotation.gtf.gz
gunzip ./RNA_seq/ref/gencode.v31lift37.annotation.gtf.gz
# transcriptome
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.transcripts.fa.gz
mv gencode.v31lift37.transcripts.fa.gz ./RNA_Seq/ref/gencode.v31lift37.transcripts.fa.gz
gunzip ./RNA_seq/ref/gencode.v31lift37.transcripts.fa.gz
# genome
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz
mv GRCh37.primary_assembly.genome.fa.gz ./RNA_Seq/ref/GRCh37.primary_assembly.genome.fa.gz
gunzip ./RNA_seq/ref/GRCh37.primary_assembly.genome.fa.gz
Download the script to generate the decoys.
# bash
# Script for generating decoy genome.
wget https://raw.githubusercontent.com/COMBINE-lab/SalmonTools/master/scripts/generateDecoyTranscriptome.sh
This script crashed a lot for me, so I ended up editing it to just snag the last 10 commands or so (those with the step comments) and removing the variable parsing at the beginning, as it uses some deprecated linux commands (readpath
) that I couldn’t install on our computing cluster.
Actually generate the gene counts with salmon
now.
# bash
for f in ./RNA_Seq/FASTQs/*; do
samp="${f##*/}"
count=$(ls -1 "$f" | wc -l)
echo "$count"
if [ $count -gt 1 ]
then
salmon quant --validateMappings --threads 2 --seqBias -l A -1 "$f"/"$samp"_1.fq.gz -2 "$f"/"$samp"_2.fq.gz -o ./RNA_Seq/quants/"$samp" --index ./RNA_Seq/ref/gencodev31_grch37_index
else
salmon quant --validateMappings --threads 2 --seqBias -l A -r "$f"/"$samp".RNA.fq.gz -o ./RNA_Seq/quants/"$samp" --index ./RNA_Seq/ref/gencodev31_grch37_index
fi
done
For gene-level summarization and readability.
# python
outter = open("./RNA_Seq/ref/tx2gene.txt", "w")
with open("./RNA_Seq/ref/gencode.v31lift37.transcripts.fa") as f:
for line in f:
if not line.startswith(">"):
continue
line = line.strip().strip(">").split("|")
tx = line[0]
symb = line[5]
out = "\t".join([tx, symb])
print(out, file = outter)
outter.close()
An easy diagnostic to ensure everything worked properly is just to check the mapping rates for each sample. Depending on the experimental setup and annotations, a mapping rate of 50-90% may be expected. Really low rates (<40% or so) are worrying, and such samples should be removed from the analysis if possible.
Install this package if you haven’t already with devtools::install_github("j-andrews7/AndrewsBCellLymphoma")
. If you don’t have devtools
, it can be installed with install.packages("devtools")
.
The only thing you need now is a sample sheet, which is just a table with sample names and metadata columns. It can include as many metadata columns as you’d like.
DESeq2 is a very popular R package for performing differential expression analysis. Raw RNA-seq counts follow a negative binomial distribution, which is how DESeq2 models the data. It is quick, flexible, and generally works quite well with little to no parameter tweaking. It can also help to mitigate the technical or batch effects on the analysis.
# R
bcl.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/BCL.v.Tonsils", quants.path = "./RNA_Seq/quants",
samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
level = "status", padj.thresh = c(0.05, 0.01, 0.001), plot.box = TRUE,
plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)
subtypes.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/Subtypes.v.Tonsils", quants.path = "./RNA_Seq/quants",
samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
level = "disease", padj.thresh = c(0.05, 0.01, 0.001), plot.box = FALSE,
plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] OneLinerOmics_0.1.0.9000 DiffBind_2.14.0
## [3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0
## [5] BiocParallel_1.20.0 matrixStats_0.55.0
## [7] Biobase_2.46.0 GenomicRanges_1.38.0
## [9] GenomeInfoDb_1.22.0 IRanges_2.20.0
## [11] S4Vectors_0.24.0 BiocGenerics_0.32.0
## [13] enrichR_2.1 BiocStyle_2.14.2
##
## loaded via a namespace (and not attached):
## [1] rsvd_1.0.2
## [2] Hmisc_4.2-0
## [3] ica_1.0-2
## [4] Rsamtools_2.2.1
## [5] lmtest_0.9-37
## [6] rprojroot_1.3-2
## [7] crayon_1.3.4
## [8] MASS_7.3-51.4
## [9] nlme_3.1-140
## [10] backports_1.1.5
## [11] GOSemSim_2.12.0
## [12] rlang_0.4.1
## [13] XVector_0.26.0
## [14] ROCR_1.0-7
## [15] ChIPQC_1.21.0
## [16] irlba_2.3.3
## [17] limma_3.42.0
## [18] scater_1.14.0
## [19] GOstats_2.52.0
## [20] rjson_0.2.20
## [21] chipseq_1.36.0
## [22] bit64_0.9-7
## [23] glue_1.3.1
## [24] pheatmap_1.0.12
## [25] sctransform_0.2.1
## [26] vipor_0.4.5
## [27] AnnotationDbi_1.48.0
## [28] vsn_3.54.0
## [29] base64url_1.4
## [30] DOSE_3.12.0
## [31] tidyselect_0.2.5
## [32] fitdistrplus_1.0-14
## [33] XML_3.98-1.20
## [34] tidyr_1.0.0
## [35] zoo_1.8-6
## [36] GenomicAlignments_1.22.1
## [37] xtable_1.8-4
## [38] magrittr_1.5
## [39] evaluate_0.14
## [40] bibtex_0.4.2.2
## [41] ggplot2_3.2.1
## [42] Rdpack_0.11-1
## [43] zlibbioc_1.32.0
## [44] sn_1.5-4
## [45] hwriter_1.3.2
## [46] rstudioapi_0.10
## [47] EZscRNA_0.1.0.9000
## [48] EnhancedVolcano_1.4.0
## [49] rpart_4.1-15
## [50] fastmatch_1.1-0
## [51] BiocSingular_1.2.1
## [52] xfun_0.11
## [53] askpass_1.1
## [54] multtest_2.42.0
## [55] cluster_2.1.0
## [56] caTools_1.17.1.3
## [57] tidygraph_1.1.2
## [58] tibble_2.1.3
## [59] ggrepel_0.8.1
## [60] ape_5.3
## [61] listenv_0.8.0
## [62] Biostrings_2.54.0
## [63] png_0.1-7
## [64] future_1.15.1
## [65] zeallot_0.1.0
## [66] withr_2.1.2
## [67] bitops_1.0-6
## [68] ggforce_0.3.1
## [69] RBGL_1.62.1
## [70] plyr_1.8.5
## [71] GSEABase_1.48.0
## [72] pillar_1.4.3
## [73] RcppParallel_4.4.4
## [74] gplots_3.0.1.1
## [75] GenomicFeatures_1.38.0
## [76] multcomp_1.4-11
## [77] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
## [78] fs_1.3.1
## [79] europepmc_0.3
## [80] DelayedMatrixStats_1.8.0
## [81] vctrs_0.2.0
## [82] urltools_1.7.3
## [83] metap_1.2
## [84] tools_3.6.1
## [85] foreign_0.8-71
## [86] beeswarm_0.2.3
## [87] munsell_0.5.0
## [88] tweenr_1.0.1
## [89] fgsea_1.12.0
## [90] compiler_3.6.1
## [91] rtracklayer_1.46.0
## [92] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [93] plotly_4.9.1
## [94] GenomeInfoDbData_1.2.2
## [95] gridExtra_2.3
## [96] edgeR_3.28.0
## [97] lattice_0.20-38
## [98] mutoss_0.1-12
## [99] AnnotationForge_1.28.0
## [100] dplyr_0.8.3
## [101] BiocFileCache_1.10.2
## [102] affy_1.64.0
## [103] jsonlite_1.6
## [104] scales_1.0.0
## [105] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2
## [106] graph_1.64.0
## [107] pbapply_1.4-2
## [108] genefilter_1.68.0
## [109] lazyeval_0.2.2
## [110] latticeExtra_0.6-29
## [111] R.utils_2.9.2
## [112] reticulate_1.14
## [113] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2
## [114] brew_1.0-6
## [115] checkmate_1.9.4
## [116] rmarkdown_2.0
## [117] pkgdown_1.4.1
## [118] sandwich_2.5-1
## [119] cowplot_1.0.0
## [120] Rtsne_0.15
## [121] BSgenome_1.54.0
## [122] uwot_0.1.5
## [123] igraph_1.2.4.2
## [124] survival_2.44-1.1
## [125] numDeriv_2016.8-1.1
## [126] yaml_2.2.0
## [127] plotrix_3.7-7
## [128] htmltools_0.4.0
## [129] memoise_1.1.0
## [130] VariantAnnotation_1.32.0
## [131] Seurat_3.1.2
## [132] locfit_1.5-9.1
## [133] graphlayouts_0.5.0
## [134] batchtools_0.9.11
## [135] viridisLite_0.3.0
## [136] digest_0.6.22
## [137] assertthat_0.2.1
## [138] rappdirs_0.3.1
## [139] npsurv_0.4-0
## [140] RSQLite_2.1.2
## [141] amap_0.8-18
## [142] future.apply_1.4.0
## [143] lsei_1.2-0
## [144] data.table_1.12.6
## [145] blob_1.2.0
## [146] R.oo_1.23.0
## [147] preprocessCore_1.48.0
## [148] splines_3.6.1
## [149] Formula_1.2-3
## [150] RCurl_1.95-4.12
## [151] hms_0.5.2
## [152] colorspace_1.4-1
## [153] base64enc_0.1-3
## [154] BiocManager_1.30.10
## [155] mnormt_1.5-5
## [156] ggbeeswarm_0.6.0
## [157] SDMTools_1.1-221.2
## [158] tximport_1.14.0
## [159] nnet_7.3-12
## [160] Rcpp_1.0.2
## [161] bookdown_0.16
## [162] RANN_2.6.1
## [163] mvtnorm_1.0-11
## [164] enrichplot_1.6.1
## [165] Nozzle.R1_1.1-1
## [166] ChIPseeker_1.22.1
## [167] R6_2.4.1
## [168] grid_3.6.1
## [169] ggridges_0.5.1
## [170] lifecycle_0.1.0
## [171] acepack_1.4.1
## [172] ShortRead_1.44.1
## [173] TFisher_0.2.0
## [174] curl_4.3
## [175] affyio_1.56.0
## [176] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2
## [177] gdata_2.18.0
## [178] leiden_0.3.1
## [179] DO.db_2.9
## [180] Matrix_1.2-17
## [181] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2
## [182] qvalue_2.18.0
## [183] desc_1.2.0
## [184] RcppAnnoy_0.0.14
## [185] TH.data_1.0-10
## [186] org.Hs.eg.db_3.10.0
## [187] RColorBrewer_1.1-2
## [188] stringr_1.4.0
## [189] htmlwidgets_1.5.1
## [190] polyclip_1.10-0
## [191] triebeard_0.3.0
## [192] biomaRt_2.42.0
## [193] purrr_0.3.3
## [194] gridGraphics_0.4-1
## [195] globals_0.12.5
## [196] openssl_1.4.1
## [197] htmlTable_1.13.3
## [198] codetools_0.2-16
## [199] GO.db_3.10.0
## [200] gtools_3.8.1
## [201] prettyunits_1.0.2
## [202] SingleCellExperiment_1.8.0
## [203] dbplyr_1.4.2
## [204] R.methodsS3_1.7.1
## [205] gtable_0.3.0
## [206] tsne_0.1-3
## [207] DBI_1.1.0
## [208] httr_1.4.1
## [209] KernSmooth_2.23-15
## [210] stringi_1.4.3
## [211] progress_1.2.2
## [212] reshape2_1.4.3
## [213] farver_2.0.1
## [214] annotate_1.64.0
## [215] viridis_0.5.1
## [216] Rgraphviz_2.30.0
## [217] xml2_1.2.2
## [218] rvcheck_0.1.7
## [219] systemPipeR_1.20.0
## [220] boot_1.3-22
## [221] BiocNeighbors_1.4.1
## [222] geneplotter_1.64.0
## [223] ggplotify_0.0.4
## [224] Category_2.52.1
## [225] DESeq2_1.26.0
## [226] bit_1.1-14
## [227] jpeg_0.1-8.1
## [228] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [229] ggraph_2.0.0
## [230] pkgconfig_2.0.3
## [231] gbRd_0.4-11
## [232] knitr_1.26