Introduction

This vignette walks through analysis of the B cell lymphoma RNA-seq data included in the future Andrews et al. paper. It uses the convenience functions included in this package to streamline the process.

Several command line tools as well as python are also utilized - each code block should indicate whether it is bash, R, or python code. If your system is set up properly, you can still run these code chunks in Rmarkdown files within RStudio, assuming you have all of the software installed.

The large majority of this vignette can be skipped if you want, the raw gene counts (or FASTQs if you want the whole shebang) is available in GSE145842. That accession also contains a table with the normalized gene counts as output from DESeq2, if you just want to spot-check your gene of interest in this dataset.

Required Software:

R (3.6+)
python
salmon (v0.14+)
bedtools - only needed if generating salmon decoy transcriptome
mashmap - only needed if generating salmon decoy transcriptome

Suggested Reading

That is, if you ask me a question and haven’t read these first, I’m going to tell you to read them.

The DESeq2 vignette.
The rnaseqGene workflow, which walks through a DESeq2 analysis in great detail.

Generating gene counts

I prefer the salmon -> DESeq2 route, as it’s fast and accurate. This requires R, salmon v0.14, and DESeq2 (along with a few other packages). salmon needs mashmap and bedtools as well if generating “decoy-aware” transcriptomes. The Google can show you how to install those.

Create decoy reference transcriptome

Newer versions of salmon recommend using a decoy-aware transcriptome to generate the index file to be used for selective alignment. This process is a bit annoying, so the authors have pre-computed some decoy transcriptomes for hg19 and hg38 with different annotations. In this case, I used the GENCODE v31 annotations that had been lifted over to hg19, so I had to generate them myself.

This had to be run on a cluster with 128 GB RAM. If you can avoid this step by using their pre-generated decoys, I recommend doing so.

You will likely have a different directory stucture, so paths will have to be edited as appropriate.

First, we’ll download all the annotation info and sequence data we need.

# bash
mkdir -p RNA_seq/ref

# annotations
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gtf.gz
mv gencode.v31lift37.annotation.gtf.gz ./RNA_Seq/ref/gencode.v31lift37.annotation.gtf.gz
gunzip ./RNA_seq/ref/gencode.v31lift37.annotation.gtf.gz

# transcriptome
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.transcripts.fa.gz
mv gencode.v31lift37.transcripts.fa.gz ./RNA_Seq/ref/gencode.v31lift37.transcripts.fa.gz
gunzip ./RNA_seq/ref/gencode.v31lift37.transcripts.fa.gz

# genome
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz
mv GRCh37.primary_assembly.genome.fa.gz ./RNA_Seq/ref/GRCh37.primary_assembly.genome.fa.gz
gunzip ./RNA_seq/ref/GRCh37.primary_assembly.genome.fa.gz

Download the script to generate the decoys.

# bash
# Script for generating decoy genome.
wget https://raw.githubusercontent.com/COMBINE-lab/SalmonTools/master/scripts/generateDecoyTranscriptome.sh

This script crashed a lot for me, so I ended up editing it to just snag the last 10 commands or so (those with the step comments) and removing the variable parsing at the beginning, as it uses some deprecated linux commands (readpath) that I couldn’t install on our computing cluster.

# bash
bash generateDecoyTranscriptome.sh -j 2 -a ./RNA_Seq/ref/gencode.v31lift37.annotation.gtf -g ./RNA_Seq/ref/GRCh37.primary_assembly.genome.fa -t ./RNA_Seq/ref/gencode.v31lift37.transcripts.fa -o ./RNA_Seq/ref/decoys/

Index the transcriptome

This will generate the decoy-aware transcriptome index for our annotations.

# Bash
# Now actually create the index for salmon. We lower k as some of our samples use 50 bp reads.
salmon index -t ./RNA_Seq/ref/decoys/gentrome.fa -i ./RNA_Seq/ref/gencodev31_grch37_index --decoys ./RNA_Seq/ref/decoys/decoys.txt -k 25 --gencode

Quantify

Actually generate the gene counts with salmon now.

# bash
for f in ./RNA_Seq/FASTQs/*; do
    samp="${f##*/}"
    count=$(ls -1 "$f" | wc -l)
    echo "$count"
    if [ $count -gt 1 ]
    then
        salmon quant --validateMappings --threads 2 --seqBias -l A -1 "$f"/"$samp"_1.fq.gz -2 "$f"/"$samp"_2.fq.gz -o ./RNA_Seq/quants/"$samp" --index ./RNA_Seq/ref/gencodev31_grch37_index
    else
        salmon quant --validateMappings --threads 2 --seqBias -l A -r "$f"/"$samp".RNA.fq.gz -o ./RNA_Seq/quants/"$samp" --index ./RNA_Seq/ref/gencodev31_grch37_index
    fi
done

Make Transcript-to-Gene Mappings

For gene-level summarization and readability.

# python
outter = open("./RNA_Seq/ref/tx2gene.txt", "w")
with open("./RNA_Seq/ref/gencode.v31lift37.transcripts.fa") as f:
    for line in f:
        if not line.startswith(">"):
            continue
        line = line.strip().strip(">").split("|")
        tx = line[0]
        symb = line[5]
        out = "\t".join([tx, symb])
        print(out, file = outter)

outter.close()

Check Mapping Rates

An easy diagnostic to ensure everything worked properly is just to check the mapping rates for each sample. Depending on the experimental setup and annotations, a mapping rate of 50-90% may be expected. Really low rates (<40% or so) are worrying, and such samples should be removed from the analysis if possible.

for f in ./RNA_Seq/quants/*; do
  echo "${f##*/}";
  grep "Mapping Rate" "$f"/logs/salmon_quant.log;
done

Analysis with R

Install this package if you haven’t already with devtools::install_github("j-andrews7/AndrewsBCellLymphoma"). If you don’t have devtools, it can be installed with install.packages("devtools").

# R
suppressPackageStartupMessages(library(OneLinerOmics))

The only thing you need now is a sample sheet, which is just a table with sample names and metadata columns. It can include as many metadata columns as you’d like.

Run DESeq2

DESeq2 is a very popular R package for performing differential expression analysis. Raw RNA-seq counts follow a negative binomial distribution, which is how DESeq2 models the data. It is quick, flexible, and generally works quite well with little to no parameter tweaking. It can also help to mitigate the technical or batch effects on the analysis.

# R
bcl.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/BCL.v.Tonsils", quants.path = "./RNA_Seq/quants", 
          samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
          level = "status", padj.thresh = c(0.05, 0.01, 0.001), plot.box = TRUE,
          plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)

subtypes.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/Subtypes.v.Tonsils", quants.path = "./RNA_Seq/quants", 
          samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
          level = "disease", padj.thresh = c(0.05, 0.01, 0.001), plot.box = FALSE,
          plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)

Next Steps

Yes, that’s it. Explore the plots, tables, enrichments, etc. Re-run if necessary, tweaking parameters as you see fit.

Session Information

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] OneLinerOmics_0.1.0.9000    DiffBind_2.14.0            
##  [3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
##  [5] BiocParallel_1.20.0         matrixStats_0.55.0         
##  [7] Biobase_2.46.0              GenomicRanges_1.38.0       
##  [9] GenomeInfoDb_1.22.0         IRanges_2.20.0             
## [11] S4Vectors_0.24.0            BiocGenerics_0.32.0        
## [13] enrichR_2.1                 BiocStyle_2.14.2           
## 
## loaded via a namespace (and not attached):
##   [1] rsvd_1.0.2                               
##   [2] Hmisc_4.2-0                              
##   [3] ica_1.0-2                                
##   [4] Rsamtools_2.2.1                          
##   [5] lmtest_0.9-37                            
##   [6] rprojroot_1.3-2                          
##   [7] crayon_1.3.4                             
##   [8] MASS_7.3-51.4                            
##   [9] nlme_3.1-140                             
##  [10] backports_1.1.5                          
##  [11] GOSemSim_2.12.0                          
##  [12] rlang_0.4.1                              
##  [13] XVector_0.26.0                           
##  [14] ROCR_1.0-7                               
##  [15] ChIPQC_1.21.0                            
##  [16] irlba_2.3.3                              
##  [17] limma_3.42.0                             
##  [18] scater_1.14.0                            
##  [19] GOstats_2.52.0                           
##  [20] rjson_0.2.20                             
##  [21] chipseq_1.36.0                           
##  [22] bit64_0.9-7                              
##  [23] glue_1.3.1                               
##  [24] pheatmap_1.0.12                          
##  [25] sctransform_0.2.1                        
##  [26] vipor_0.4.5                              
##  [27] AnnotationDbi_1.48.0                     
##  [28] vsn_3.54.0                               
##  [29] base64url_1.4                            
##  [30] DOSE_3.12.0                              
##  [31] tidyselect_0.2.5                         
##  [32] fitdistrplus_1.0-14                      
##  [33] XML_3.98-1.20                            
##  [34] tidyr_1.0.0                              
##  [35] zoo_1.8-6                                
##  [36] GenomicAlignments_1.22.1                 
##  [37] xtable_1.8-4                             
##  [38] magrittr_1.5                             
##  [39] evaluate_0.14                            
##  [40] bibtex_0.4.2.2                           
##  [41] ggplot2_3.2.1                            
##  [42] Rdpack_0.11-1                            
##  [43] zlibbioc_1.32.0                          
##  [44] sn_1.5-4                                 
##  [45] hwriter_1.3.2                            
##  [46] rstudioapi_0.10                          
##  [47] EZscRNA_0.1.0.9000                       
##  [48] EnhancedVolcano_1.4.0                    
##  [49] rpart_4.1-15                             
##  [50] fastmatch_1.1-0                          
##  [51] BiocSingular_1.2.1                       
##  [52] xfun_0.11                                
##  [53] askpass_1.1                              
##  [54] multtest_2.42.0                          
##  [55] cluster_2.1.0                            
##  [56] caTools_1.17.1.3                         
##  [57] tidygraph_1.1.2                          
##  [58] tibble_2.1.3                             
##  [59] ggrepel_0.8.1                            
##  [60] ape_5.3                                  
##  [61] listenv_0.8.0                            
##  [62] Biostrings_2.54.0                        
##  [63] png_0.1-7                                
##  [64] future_1.15.1                            
##  [65] zeallot_0.1.0                            
##  [66] withr_2.1.2                              
##  [67] bitops_1.0-6                             
##  [68] ggforce_0.3.1                            
##  [69] RBGL_1.62.1                              
##  [70] plyr_1.8.5                               
##  [71] GSEABase_1.48.0                          
##  [72] pillar_1.4.3                             
##  [73] RcppParallel_4.4.4                       
##  [74] gplots_3.0.1.1                           
##  [75] GenomicFeatures_1.38.0                   
##  [76] multcomp_1.4-11                          
##  [77] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
##  [78] fs_1.3.1                                 
##  [79] europepmc_0.3                            
##  [80] DelayedMatrixStats_1.8.0                 
##  [81] vctrs_0.2.0                              
##  [82] urltools_1.7.3                           
##  [83] metap_1.2                                
##  [84] tools_3.6.1                              
##  [85] foreign_0.8-71                           
##  [86] beeswarm_0.2.3                           
##  [87] munsell_0.5.0                            
##  [88] tweenr_1.0.1                             
##  [89] fgsea_1.12.0                             
##  [90] compiler_3.6.1                           
##  [91] rtracklayer_1.46.0                       
##  [92] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
##  [93] plotly_4.9.1                             
##  [94] GenomeInfoDbData_1.2.2                   
##  [95] gridExtra_2.3                            
##  [96] edgeR_3.28.0                             
##  [97] lattice_0.20-38                          
##  [98] mutoss_0.1-12                            
##  [99] AnnotationForge_1.28.0                   
## [100] dplyr_0.8.3                              
## [101] BiocFileCache_1.10.2                     
## [102] affy_1.64.0                              
## [103] jsonlite_1.6                             
## [104] scales_1.0.0                             
## [105] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2  
## [106] graph_1.64.0                             
## [107] pbapply_1.4-2                            
## [108] genefilter_1.68.0                        
## [109] lazyeval_0.2.2                           
## [110] latticeExtra_0.6-29                      
## [111] R.utils_2.9.2                            
## [112] reticulate_1.14                          
## [113] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2  
## [114] brew_1.0-6                               
## [115] checkmate_1.9.4                          
## [116] rmarkdown_2.0                            
## [117] pkgdown_1.4.1                            
## [118] sandwich_2.5-1                           
## [119] cowplot_1.0.0                            
## [120] Rtsne_0.15                               
## [121] BSgenome_1.54.0                          
## [122] uwot_0.1.5                               
## [123] igraph_1.2.4.2                           
## [124] survival_2.44-1.1                        
## [125] numDeriv_2016.8-1.1                      
## [126] yaml_2.2.0                               
## [127] plotrix_3.7-7                            
## [128] htmltools_0.4.0                          
## [129] memoise_1.1.0                            
## [130] VariantAnnotation_1.32.0                 
## [131] Seurat_3.1.2                             
## [132] locfit_1.5-9.1                           
## [133] graphlayouts_0.5.0                       
## [134] batchtools_0.9.11                        
## [135] viridisLite_0.3.0                        
## [136] digest_0.6.22                            
## [137] assertthat_0.2.1                         
## [138] rappdirs_0.3.1                           
## [139] npsurv_0.4-0                             
## [140] RSQLite_2.1.2                            
## [141] amap_0.8-18                              
## [142] future.apply_1.4.0                       
## [143] lsei_1.2-0                               
## [144] data.table_1.12.6                        
## [145] blob_1.2.0                               
## [146] R.oo_1.23.0                              
## [147] preprocessCore_1.48.0                    
## [148] splines_3.6.1                            
## [149] Formula_1.2-3                            
## [150] RCurl_1.95-4.12                          
## [151] hms_0.5.2                                
## [152] colorspace_1.4-1                         
## [153] base64enc_0.1-3                          
## [154] BiocManager_1.30.10                      
## [155] mnormt_1.5-5                             
## [156] ggbeeswarm_0.6.0                         
## [157] SDMTools_1.1-221.2                       
## [158] tximport_1.14.0                          
## [159] nnet_7.3-12                              
## [160] Rcpp_1.0.2                               
## [161] bookdown_0.16                            
## [162] RANN_2.6.1                               
## [163] mvtnorm_1.0-11                           
## [164] enrichplot_1.6.1                         
## [165] Nozzle.R1_1.1-1                          
## [166] ChIPseeker_1.22.1                        
## [167] R6_2.4.1                                 
## [168] grid_3.6.1                               
## [169] ggridges_0.5.1                           
## [170] lifecycle_0.1.0                          
## [171] acepack_1.4.1                            
## [172] ShortRead_1.44.1                         
## [173] TFisher_0.2.0                            
## [174] curl_4.3                                 
## [175] affyio_1.56.0                            
## [176] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2  
## [177] gdata_2.18.0                             
## [178] leiden_0.3.1                             
## [179] DO.db_2.9                                
## [180] Matrix_1.2-17                            
## [181] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2     
## [182] qvalue_2.18.0                            
## [183] desc_1.2.0                               
## [184] RcppAnnoy_0.0.14                         
## [185] TH.data_1.0-10                           
## [186] org.Hs.eg.db_3.10.0                      
## [187] RColorBrewer_1.1-2                       
## [188] stringr_1.4.0                            
## [189] htmlwidgets_1.5.1                        
## [190] polyclip_1.10-0                          
## [191] triebeard_0.3.0                          
## [192] biomaRt_2.42.0                           
## [193] purrr_0.3.3                              
## [194] gridGraphics_0.4-1                       
## [195] globals_0.12.5                           
## [196] openssl_1.4.1                            
## [197] htmlTable_1.13.3                         
## [198] codetools_0.2-16                         
## [199] GO.db_3.10.0                             
## [200] gtools_3.8.1                             
## [201] prettyunits_1.0.2                        
## [202] SingleCellExperiment_1.8.0               
## [203] dbplyr_1.4.2                             
## [204] R.methodsS3_1.7.1                        
## [205] gtable_0.3.0                             
## [206] tsne_0.1-3                               
## [207] DBI_1.1.0                                
## [208] httr_1.4.1                               
## [209] KernSmooth_2.23-15                       
## [210] stringi_1.4.3                            
## [211] progress_1.2.2                           
## [212] reshape2_1.4.3                           
## [213] farver_2.0.1                             
## [214] annotate_1.64.0                          
## [215] viridis_0.5.1                            
## [216] Rgraphviz_2.30.0                         
## [217] xml2_1.2.2                               
## [218] rvcheck_0.1.7                            
## [219] systemPipeR_1.20.0                       
## [220] boot_1.3-22                              
## [221] BiocNeighbors_1.4.1                      
## [222] geneplotter_1.64.0                       
## [223] ggplotify_0.0.4                          
## [224] Category_2.52.1                          
## [225] DESeq2_1.26.0                            
## [226] bit_1.1-14                               
## [227] jpeg_0.1-8.1                             
## [228] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [229] ggraph_2.0.0                             
## [230] pkgconfig_2.0.3                          
## [231] gbRd_0.4-11                              
## [232] knitr_1.26

Analyzing RNA-seq data

Jared Andrews

Revised: February 25th, 2020