halfBaked - RNA-seq - Pin Generation Template • halfBaked

Study - Experiment - RNA-seq Data - Pin Generation

This notebook details how to create re-usable and easily shared pins for RNA-seq data. These pins will contain a SummarizedExperiment object containing gene counts and TPMs along with sample metadata. In short, pretty much everything one would need to perform downstream analyses.

For convenience, we’ll also store experiment/dataset description, the code used to generate the pin (i.e. this notebook).

It’s recommended to detail how gene counts were generated here.

We typically use the nf-core RNA-seq pipeline to generate BAMs and salmon quantifications, which are then used to generate gene counts appropriate for downstream use. This is convenient as the samplesheets can double as our sample metadata tables here.

Noting the pipeline version, genome information, and any altered parameters here is recommended.

As an example, this notebook details pin generation for data from GSE135880, which was processed with the nf-core RNA-seq pipeline:

nextflow run nf-core/rnaseq -r 3.12.0 -profile singularity \
-c "$BAKER_REF"/nf_configs/rnaseq.config \
-w /scratch_space/jandrews/"$LSB_JOBNAME" \
--outdir ./nfcore_mm10 --email jared.andrews@stjude.org \
--input nfcore_rnaseq.samplesheet.csv --gencode --genome MM10 \
--aligner star_salmon \
--pseudo_aligner salmon --max_memory 128.GB --skip_stringtie \
--max_multiqc_email_size 15.MB -resume

Params Usage

This notebook uses YAML params in the header to specify the pin name and board to upload the pin to. This allows for easy re-use of the notebook for different datasets and boards in a lazy fashion.

params:
  board: "your_server_info"
  pin_name: "GSE135880_SummarizedExperiment_mm10"
  pin_description: "A SummarizedExperiment containing O4+ immunopanned oligodendrocyte precursor cells (OPCs) from cortices of P5 or P6 Eed KO or control mice."

Read more about the pins package if you don’t know what a “board” is or how to use them.

Experiment Info

description <- "
  **Wang J. et al SciAdv 2020 - GSE135880 - Mouse OPCs with _Eed_ KO - RNA-seq Data**

  This dataset contains O4+ immunopanned oligodendrocyte precursor cells (OPCs) from 
  cortices of P5 or P6 Eed KO or control mice.
  See the [associated publication](https://www.science.org/doi/10.1126/sciadv.aaz6477) 
  for more details.

  Please see the Pin generation code to view how this data was processed.
  "

Load Data & Create SummarizedExperiment

library(SummarizedExperiment)
library(readr)
library(pins)
library(DESeq2)
library(edgeR)

# For gene ID mapping
if (params$species == "mouse") {
    library(org.Mm.eg.db)
    org.db <- org.Mm.eg.db
} else if (params$species == "human") {
    library(org.Hs.eg.db)
    org.db <- org.Hs.eg.db
} else {
    stop("Organism not supported.")
}

# Load sample metadata.
meta <- read.csv("nfcore_rnaseq.samplesheet.csv", header = TRUE,
                 stringsAsFactors = TRUE)

# Drop FASTQ file locations.
meta <- meta[, !colnames(meta) %in% c("fastq_1", "fastq_2")]

# Load counts. This object was generated using tximport via the nf-core 
# RNA-seq pipeline on the salmon quants and
# is appropriate for pretty much all downstream DE packages (DESeq2, edgeR, limma).
cts <- read.table("salmon.merged.gene_counts_length_scaled.tsv", header = TRUE, 
                  sep = "\t", stringsAsFactors = FALSE)

# Counts table has first two columns as gene IDs and gene symbols.
genes <- cts[, 1:2]
names(genes) <- c("ENSEMBL", "SYMBOL")
rownames(cts) <- cts[, 1]

# Remove the gene version info from the ENSEMBL IDs
genes$ENSEMBL <- gsub("\\..*", "", genes$ENSEMBL)

# 2) Using mapIds() to get a named vector of ENTREZ IDs
genes$ENTREZ <- mapIds(org.db,
                     keys=genes$ENSEMBL,
                     column="ENTREZID",
                     keytype="ENSEMBL",
                     multiVals="first")

# Set metadata rownames and ensure they match count column names.
rownames(meta) <- meta$sample

# Also carry along TPMs as an additional assay.
tpms <- read.table("salmon.merged.gene_tpm.tsv", header = TRUE, 
                   sep = "\t", stringsAsFactors = FALSE)
rownames(tpms) <- tpms[, 1]
tpms <- tpms[, rownames(meta)]

cts <- cts[, rownames(meta)]

# Create a SummarizedExperiment object.
se <- SummarizedExperiment(
    assays = list(counts = round(as.matrix(cts)),
                  tpm = as.matrix(tpms),
                  log2tpm = log2(as.matrix(tpms) + 1)),
    colData = meta,
    rowData = genes
)

# Limit to reasonably expressed genes, adjust design or use `group` as needed.
design <- model.matrix(~0 + Group, data = colData(se))
keep <- filterByExpr(se, design = design)
se <- se[keep, ]

# Add various normalized counts
assay(se, "vst") <- vst(assay(se, "counts"))
assay(se, "cpm") <- cpm(se)
assay(se, "log2cpm") <- cpm(se, log = TRUE)

Add Metadata to Summarized Experiment and Upload Pin

# Render notebook and add to the metadata of the pin along with experiment metadata.
pin_code <- read_file(rmarkdown::render("0.RNAseq_PinGen_Template.Rmd", quiet = TRUE))
metadata(se) <- list(pin_code = pin_code, description = description)

# Save the object locally as well.
saveRDS(se, file = paste0(params$pin_name, ".rds"))

# Here you would connect to your board and write the pin to it.
board <- board_connect(server = params$board)
pin_write(board, se, type = "rds", name = params$pin_name, title = params$pin_name, 
          description = params$pin_description)

Session Info

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.36.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37       desc_1.4.3          R6_2.6.1           
##  [4] bookdown_0.43       fastmap_1.2.0       xfun_0.52          
##  [7] cachem_1.1.0        knitr_1.50          htmltools_0.5.8.1  
## [10] rmarkdown_2.29      lifecycle_1.0.4     cli_3.6.5          
## [13] pkgdown_2.1.3       sass_0.4.10         textshaping_1.0.1  
## [16] jquerylib_0.1.4     systemfonts_1.2.3   compiler_4.5.1     
## [19] tools_4.5.1         ragg_1.4.0          evaluate_1.0.3     
## [22] bslib_0.9.0         yaml_2.3.10         BiocManager_1.30.26
## [25] jsonlite_2.0.0      rlang_1.1.6         fs_1.6.6

halfBaked - RNA-seq - Pin Generation Template

Jared Andrews