Introduction

This vignette walks through analysis of the B cell lymphoma RNA-seq data included in the future Andrews et al. paper. It uses the convenience functions included in this package to streamline the process.

Several command line tools as well as python are also utilized - each code block should indicate whether it is bash, R, or python code. If your system is set up properly, you can still run these code chunks in Rmarkdown files within RStudio, assuming you have all of the software installed.

The large majority of this vignette can be skipped if you want, the raw gene counts (or FASTQs if you want the whole shebang) is available in GSE145842. That accession also contains a table with the normalized gene counts as output from DESeq2, if you just want to spot-check your gene of interest in this dataset.

Required Software:

  • R (3.6+)
  • python
  • salmon (v0.14+)
  • bedtools - only needed if generating salmon decoy transcriptome
  • mashmap - only needed if generating salmon decoy transcriptome

Suggested Reading

That is, if you ask me a question and haven’t read these first, I’m going to tell you to read them.

Generating gene counts

I prefer the salmon -> DESeq2 route, as it’s fast and accurate. This requires R, salmon v0.14, and DESeq2 (along with a few other packages). salmon needs mashmap and bedtools as well if generating “decoy-aware” transcriptomes. The Google can show you how to install those.

Create decoy reference transcriptome

Newer versions of salmon recommend using a decoy-aware transcriptome to generate the index file to be used for selective alignment. This process is a bit annoying, so the authors have pre-computed some decoy transcriptomes for hg19 and hg38 with different annotations. In this case, I used the GENCODE v31 annotations that had been lifted over to hg19, so I had to generate them myself.

This had to be run on a cluster with 128 GB RAM. If you can avoid this step by using their pre-generated decoys, I recommend doing so.

You will likely have a different directory stucture, so paths will have to be edited as appropriate.

First, we’ll download all the annotation info and sequence data we need.

Download the script to generate the decoys.

This script crashed a lot for me, so I ended up editing it to just snag the last 10 commands or so (those with the step comments) and removing the variable parsing at the beginning, as it uses some deprecated linux commands (readpath) that I couldn’t install on our computing cluster.

Check Mapping Rates

An easy diagnostic to ensure everything worked properly is just to check the mapping rates for each sample. Depending on the experimental setup and annotations, a mapping rate of 50-90% may be expected. Really low rates (<40% or so) are worrying, and such samples should be removed from the analysis if possible.

Analysis with R

Install this package if you haven’t already with devtools::install_github("j-andrews7/AndrewsBCellLymphoma"). If you don’t have devtools, it can be installed with install.packages("devtools").

The only thing you need now is a sample sheet, which is just a table with sample names and metadata columns. It can include as many metadata columns as you’d like.

Run DESeq2

DESeq2 is a very popular R package for performing differential expression analysis. Raw RNA-seq counts follow a negative binomial distribution, which is how DESeq2 models the data. It is quick, flexible, and generally works quite well with little to no parameter tweaking. It can also help to mitigate the technical or batch effects on the analysis.

# R
bcl.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/BCL.v.Tonsils", quants.path = "./RNA_Seq/quants", 
          samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
          level = "status", padj.thresh = c(0.05, 0.01, 0.001), plot.box = TRUE,
          plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)

subtypes.v.tonsils <- RunDESeq2(outpath = "./RNA_Seq/Final_Analyses/Subtypes.v.Tonsils", quants.path = "./RNA_Seq/quants", 
          samplesheet = "./RNA_Seq/Samples.txt", tx2gene = "./RNA_Seq/ref/tx2gene.txt",
          level = "disease", padj.thresh = c(0.05, 0.01, 0.001), plot.box = FALSE,
          plot.annos = c("disease", "disease.sub", "status"), count.filt = 50)

Next Steps

Yes, that’s it. Explore the plots, tables, enrichments, etc. Re-run if necessary, tweaking parameters as you see fit.

Session Information

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] OneLinerOmics_0.1.0.9000    DiffBind_2.14.0            
##  [3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
##  [5] BiocParallel_1.20.0         matrixStats_0.55.0         
##  [7] Biobase_2.46.0              GenomicRanges_1.38.0       
##  [9] GenomeInfoDb_1.22.0         IRanges_2.20.0             
## [11] S4Vectors_0.24.0            BiocGenerics_0.32.0        
## [13] enrichR_2.1                 BiocStyle_2.14.2           
## 
## loaded via a namespace (and not attached):
##   [1] rsvd_1.0.2                               
##   [2] Hmisc_4.2-0                              
##   [3] ica_1.0-2                                
##   [4] Rsamtools_2.2.1                          
##   [5] lmtest_0.9-37                            
##   [6] rprojroot_1.3-2                          
##   [7] crayon_1.3.4                             
##   [8] MASS_7.3-51.4                            
##   [9] nlme_3.1-140                             
##  [10] backports_1.1.5                          
##  [11] GOSemSim_2.12.0                          
##  [12] rlang_0.4.1                              
##  [13] XVector_0.26.0                           
##  [14] ROCR_1.0-7                               
##  [15] ChIPQC_1.21.0                            
##  [16] irlba_2.3.3                              
##  [17] limma_3.42.0                             
##  [18] scater_1.14.0                            
##  [19] GOstats_2.52.0                           
##  [20] rjson_0.2.20                             
##  [21] chipseq_1.36.0                           
##  [22] bit64_0.9-7                              
##  [23] glue_1.3.1                               
##  [24] pheatmap_1.0.12                          
##  [25] sctransform_0.2.1                        
##  [26] vipor_0.4.5                              
##  [27] AnnotationDbi_1.48.0                     
##  [28] vsn_3.54.0                               
##  [29] base64url_1.4                            
##  [30] DOSE_3.12.0                              
##  [31] tidyselect_0.2.5                         
##  [32] fitdistrplus_1.0-14                      
##  [33] XML_3.98-1.20                            
##  [34] tidyr_1.0.0                              
##  [35] zoo_1.8-6                                
##  [36] GenomicAlignments_1.22.1                 
##  [37] xtable_1.8-4                             
##  [38] magrittr_1.5                             
##  [39] evaluate_0.14                            
##  [40] bibtex_0.4.2.2                           
##  [41] ggplot2_3.2.1                            
##  [42] Rdpack_0.11-1                            
##  [43] zlibbioc_1.32.0                          
##  [44] sn_1.5-4                                 
##  [45] hwriter_1.3.2                            
##  [46] rstudioapi_0.10                          
##  [47] EZscRNA_0.1.0.9000                       
##  [48] EnhancedVolcano_1.4.0                    
##  [49] rpart_4.1-15                             
##  [50] fastmatch_1.1-0                          
##  [51] BiocSingular_1.2.1                       
##  [52] xfun_0.11                                
##  [53] askpass_1.1                              
##  [54] multtest_2.42.0                          
##  [55] cluster_2.1.0                            
##  [56] caTools_1.17.1.3                         
##  [57] tidygraph_1.1.2                          
##  [58] tibble_2.1.3                             
##  [59] ggrepel_0.8.1                            
##  [60] ape_5.3                                  
##  [61] listenv_0.8.0                            
##  [62] Biostrings_2.54.0                        
##  [63] png_0.1-7                                
##  [64] future_1.15.1                            
##  [65] zeallot_0.1.0                            
##  [66] withr_2.1.2                              
##  [67] bitops_1.0-6                             
##  [68] ggforce_0.3.1                            
##  [69] RBGL_1.62.1                              
##  [70] plyr_1.8.5                               
##  [71] GSEABase_1.48.0                          
##  [72] pillar_1.4.3                             
##  [73] RcppParallel_4.4.4                       
##  [74] gplots_3.0.1.1                           
##  [75] GenomicFeatures_1.38.0                   
##  [76] multcomp_1.4-11                          
##  [77] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
##  [78] fs_1.3.1                                 
##  [79] europepmc_0.3                            
##  [80] DelayedMatrixStats_1.8.0                 
##  [81] vctrs_0.2.0                              
##  [82] urltools_1.7.3                           
##  [83] metap_1.2                                
##  [84] tools_3.6.1                              
##  [85] foreign_0.8-71                           
##  [86] beeswarm_0.2.3                           
##  [87] munsell_0.5.0                            
##  [88] tweenr_1.0.1                             
##  [89] fgsea_1.12.0                             
##  [90] compiler_3.6.1                           
##  [91] rtracklayer_1.46.0                       
##  [92] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
##  [93] plotly_4.9.1                             
##  [94] GenomeInfoDbData_1.2.2                   
##  [95] gridExtra_2.3                            
##  [96] edgeR_3.28.0                             
##  [97] lattice_0.20-38                          
##  [98] mutoss_0.1-12                            
##  [99] AnnotationForge_1.28.0                   
## [100] dplyr_0.8.3                              
## [101] BiocFileCache_1.10.2                     
## [102] affy_1.64.0                              
## [103] jsonlite_1.6                             
## [104] scales_1.0.0                             
## [105] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2  
## [106] graph_1.64.0                             
## [107] pbapply_1.4-2                            
## [108] genefilter_1.68.0                        
## [109] lazyeval_0.2.2                           
## [110] latticeExtra_0.6-29                      
## [111] R.utils_2.9.2                            
## [112] reticulate_1.14                          
## [113] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2  
## [114] brew_1.0-6                               
## [115] checkmate_1.9.4                          
## [116] rmarkdown_2.0                            
## [117] pkgdown_1.4.1                            
## [118] sandwich_2.5-1                           
## [119] cowplot_1.0.0                            
## [120] Rtsne_0.15                               
## [121] BSgenome_1.54.0                          
## [122] uwot_0.1.5                               
## [123] igraph_1.2.4.2                           
## [124] survival_2.44-1.1                        
## [125] numDeriv_2016.8-1.1                      
## [126] yaml_2.2.0                               
## [127] plotrix_3.7-7                            
## [128] htmltools_0.4.0                          
## [129] memoise_1.1.0                            
## [130] VariantAnnotation_1.32.0                 
## [131] Seurat_3.1.2                             
## [132] locfit_1.5-9.1                           
## [133] graphlayouts_0.5.0                       
## [134] batchtools_0.9.11                        
## [135] viridisLite_0.3.0                        
## [136] digest_0.6.22                            
## [137] assertthat_0.2.1                         
## [138] rappdirs_0.3.1                           
## [139] npsurv_0.4-0                             
## [140] RSQLite_2.1.2                            
## [141] amap_0.8-18                              
## [142] future.apply_1.4.0                       
## [143] lsei_1.2-0                               
## [144] data.table_1.12.6                        
## [145] blob_1.2.0                               
## [146] R.oo_1.23.0                              
## [147] preprocessCore_1.48.0                    
## [148] splines_3.6.1                            
## [149] Formula_1.2-3                            
## [150] RCurl_1.95-4.12                          
## [151] hms_0.5.2                                
## [152] colorspace_1.4-1                         
## [153] base64enc_0.1-3                          
## [154] BiocManager_1.30.10                      
## [155] mnormt_1.5-5                             
## [156] ggbeeswarm_0.6.0                         
## [157] SDMTools_1.1-221.2                       
## [158] tximport_1.14.0                          
## [159] nnet_7.3-12                              
## [160] Rcpp_1.0.2                               
## [161] bookdown_0.16                            
## [162] RANN_2.6.1                               
## [163] mvtnorm_1.0-11                           
## [164] enrichplot_1.6.1                         
## [165] Nozzle.R1_1.1-1                          
## [166] ChIPseeker_1.22.1                        
## [167] R6_2.4.1                                 
## [168] grid_3.6.1                               
## [169] ggridges_0.5.1                           
## [170] lifecycle_0.1.0                          
## [171] acepack_1.4.1                            
## [172] ShortRead_1.44.1                         
## [173] TFisher_0.2.0                            
## [174] curl_4.3                                 
## [175] affyio_1.56.0                            
## [176] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2  
## [177] gdata_2.18.0                             
## [178] leiden_0.3.1                             
## [179] DO.db_2.9                                
## [180] Matrix_1.2-17                            
## [181] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2     
## [182] qvalue_2.18.0                            
## [183] desc_1.2.0                               
## [184] RcppAnnoy_0.0.14                         
## [185] TH.data_1.0-10                           
## [186] org.Hs.eg.db_3.10.0                      
## [187] RColorBrewer_1.1-2                       
## [188] stringr_1.4.0                            
## [189] htmlwidgets_1.5.1                        
## [190] polyclip_1.10-0                          
## [191] triebeard_0.3.0                          
## [192] biomaRt_2.42.0                           
## [193] purrr_0.3.3                              
## [194] gridGraphics_0.4-1                       
## [195] globals_0.12.5                           
## [196] openssl_1.4.1                            
## [197] htmlTable_1.13.3                         
## [198] codetools_0.2-16                         
## [199] GO.db_3.10.0                             
## [200] gtools_3.8.1                             
## [201] prettyunits_1.0.2                        
## [202] SingleCellExperiment_1.8.0               
## [203] dbplyr_1.4.2                             
## [204] R.methodsS3_1.7.1                        
## [205] gtable_0.3.0                             
## [206] tsne_0.1-3                               
## [207] DBI_1.1.0                                
## [208] httr_1.4.1                               
## [209] KernSmooth_2.23-15                       
## [210] stringi_1.4.3                            
## [211] progress_1.2.2                           
## [212] reshape2_1.4.3                           
## [213] farver_2.0.1                             
## [214] annotate_1.64.0                          
## [215] viridis_0.5.1                            
## [216] Rgraphviz_2.30.0                         
## [217] xml2_1.2.2                               
## [218] rvcheck_0.1.7                            
## [219] systemPipeR_1.20.0                       
## [220] boot_1.3-22                              
## [221] BiocNeighbors_1.4.1                      
## [222] geneplotter_1.64.0                       
## [223] ggplotify_0.0.4                          
## [224] Category_2.52.1                          
## [225] DESeq2_1.26.0                            
## [226] bit_1.1-14                               
## [227] jpeg_0.1-8.1                             
## [228] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [229] ggraph_2.0.0                             
## [230] pkgconfig_2.0.3                          
## [231] gbRd_0.4-11                              
## [232] knitr_1.26