Introduction

This vignette walks through ChIP-seq analysis of the B cell lymphoma ChIP-seq data included in the future Andrews et al. paper. It uses the convenience functions included in this package to streamline the process.

Several command line tools as well as python are also utilized - each code block should indicate whether it is bash, R, or python code. If your system is set up properly, you can still run many of these code chunks in Rmarkdown files within RStudio, assuming you have all of the software installed.

All data used here is available in GSE145841.

Some of this code is best ran on a computing cluster, as it takes a significant amount of time locally. Such instances will have - cluster appended to the language statement at the beginning of the code block.

Required Software:

R (3.6+)
python (ideally one conda environment with python2.7, another with python3)
ROSE
samtools
bedtools2
MACS - I used v2.1.0.20150420 but more recent ones are available now.
kentUtils
bowtie2

FAQ/Suggested Reading

That is, if you ask me one of the following questions and haven’t read/tried these first, I’m going to tell you to read/try them.

“What does this error during alignment mean?”
- Read the bowtie2 manual.
“Why do I have so few peaks after peakcalling?”
- Your data is probably low quality. This can be confirmed visually in a genome browser (no clear peaks == bad) or by viewing the fraction of reads in peaks/cross-correlation values during QC. You can play with the -m and -q parameters for MACS2 to adjust the sensitivity, but usually it just means your IP didn’t work all that well.
“Why won’t ROSE work?”
- Do you have R, samtools, and python 2.7 installed?
- Are you calling the command from the directory where you downloaded ROSE? Adding it to PATH does not seem to work.
“What are these mcfork errors about from ChIPQC/DiffBind?”
- Parallelization failure, especially if you’re on Windows/WSL. Running register(SerialParam()) will usually fix this.
“Your paths don’t make any sense!”
- Well, yeah, but much of the alignment, peak calling, etc., code is copy-pasted with minor editing merely to demonstrate the process and serve as a record for which parameters were used. It may require a smidgen of effort to adapt to your directory structure. The main purpose of this vignette is to show how the package works, which should be much more clear.

Alignment, Indexing, and Blacklisted Read Removal

In order to generate tracks, perform peak calling, and compare signal, we need to align all of our files to the genome. This process will also sort and removed blacklisted reads from the resulting BAM file, then index it. It requires your genome sequence in FASTA format, which you can easily download. Using a different genome version (or organism) is easy - just swap out the version (hg19 -> hg38) or look for your organism.

# bash
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
gunzip hg19.fa.gz

# Grab chromosome sizes, need these for blacklisting.
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.chrom.sizes
mkdir ./ChIP_Seq/ref
mv hg19.chrom.sizes ./ChIP_Seq/ref/hg19.chrom.sizes

bowtie2 also has to build an index for the genome prior to use.

# bash
bowtie-build hg19.fa hg19

Blacklisted regions are those that have high artificial signal in pretty much every ChIP experiment - usually due to being extremely repetitive or near centromeres. ENCODE identified these regions, so we can remove all such reads so they don’t bias our normalization/differential binding analyses later on. A (probably quicker) alternative whould be to remove peaks that overlap these regions.

# bash
wget https://www.encodeproject.org/files/ENCFF001TDO/@@download/ENCFF001TDO.bed.gz
gunzip ENCFF001TDO.bed.gz
mv ENCFF001TDO.bed ./ChIP_Seq/ref/ENCODE_blacklists.bed

# Due to how samtools view works, we create a file of regions we WANT to keep rather than exclude.
bedtools complement -i ./ChIP_Seq/ref/ENCODE_blacklists.bed \
-g ./ChIP_Seq/ref/hg19.chrom.sizes > ./ChIP_Seq/ref/hg19_blacklist_regions_removed.bed

Now we can perform the actual alignment.

# bash - cluster

# Directory containing bowtie indices, blacklist, etc.
genome="./ChIP_Seq/ref/hg19"
# Base directory containing reads and where additional directories will be created.
base_dir="./ChIP_Seq/READS/"

cd "$base_dir"
mkdir BAMS

# Alignment.
for f in ./ChIP_Seq/READS/*.fq.gz; do
  echo Aligning "$f"
  bowtie2 -p 6 -x ./ChIP_Seq/ref/hg19 -U "$f" | samtools view -@ 8 -b -S -u - \
  | samtools sort -@ 8 -m 5G -o ./ChIP_Seq/BAMS/"${f%.*}".sorted.bam -;
  samtools index ./ChIP_Seq/BAMS/"${f%.*}".sorted.bam
done

# Removal of blacklist reads.
for f in ./ChIP_Seq/BAMS/*.bam; do
    echo Removing blacklisted reads from "$f"
    base=${f##*/}
    samtools view -@ 8 -b -t ./ChIP_Seq/Ref/hg19.fa \
    -L ./ChIP_Seq/Ref/hg19_blacklist_regions_removed.bed \
    -o ./ChIP_Seq/BAMS/"${base%.*}".BL_removed.bam "$f";
    samtools index ./ChIP_Seq/BAMS/"${base%.*}".BL_removed.bam;
done

Peak Calling

Another essential step is peak calling. Tons of peak callers exist, some are better at broad peak calling, some are made for TFs, etc etc. MACS is one of the most common and most well-regarded peak callers out there. It generally does pretty well even if you botch the options, and the author has recently began developing it again. The script below is meant to run on a batch of files that have a corresponding input/IgG sample within the same folder.

Here I use a q-value of 0.01, as this should help reduce peaks called in samples with high background. You may adjust them as you see fit.

# bash - cluster

# This example is for H3K27AC samples, hence the wildcard use to capture that.

for f in ./ChIP_Seq/BAMs/*C.sorted.BL_removed.bam; do
    base=${f##*/}
    macs2 callpeak -t "$f" -c ./ChIP_Seq/BAMs/*INPUT.sorted.BL_removed.bam -n "$base" \
    --outdir ./ChIP_Seq/MACS/H3K27AC --tempdir ./ChIP_Seq/scratch/MACS2 -q 0.01 \
    --nomodel --shiftsize=150;
done

Pre-filter

I also only care about autosomal peaks, so I filter those out. I also have several marks, so I organized the resulting files into directories.

# bash
marks=(FAIRE H3AC H3K27AC H3K4ME1)
for m in "${marks[@]}"; do
    for f in ./ChIP_Seq/MACS2/"$m"/PEAKS_NARROW/*.narrowPeak; do
        sed '/_g\|chrM\|uN\|rand\|chrY\|chrX\|chr23/d' "$f" > "$f".clean
        rename.ul .narrowPeak.clean .clean.narrowPeak "$f".clean
    done
done

Super Enhancer Calling with ROSE

ROSE was the original tool made for SE calling, and it’s still used today. Some other tools do the same thing, but they all take the same approach, so might as well stick with the OG, yeah?

First, we need our peaks in a gff format.

# python

import glob
# Store file names
marks = ["FAIRE", "H3AC", "H3K27AC", "H3K4ME1"]

for m in marks:
  in_files = glob.glob("./ChIP_Seq/MACS2/" + m + "/PEAKS_NARROW/*.narrowPeak")
  
  for n in in_files:
    base = ".".join(input_file.split(".")[0:-1])
    output_file = "./ChIP_Seq/MACS2/" + m + "/PEAKS_NARROW/" + base + ".gff"
    
    with open(n) as f:
    
        # Open output file, "w" to make it writable
        output_f = open(output_file, "w")
        new_ID = 1
    
        for line in f:
            line = line.strip().split()
    
            # Handle if bed file already has an ID that should be retained.
            if len(line) > 3:
                if line[3] is not None and line[3] is not ".":
                    ID = line[3]
                else:
                    ID = new_ID
            else:
                ID = new_ID
    
            print(line[0], ID, "PEAK", str(int(line[1]) + 1), line[2],
                  ".", ".", ".", ID, sep="\t", file=output_f)
    
            new_ID += 1
    
        output_f.close()

Easy enough to use, just have to feed it the BAM file for both your sample and the input along with the peaks. Requires samtools and R on your PATH, so be sure both of those can be run. It also has to be run from the rose folder wherever you cloned it - trying to add the rose directory to your PATH really doesn’t work. These are pretty standard options. -t defines what you want to consider the promoter of a gene, in this case we’re using +/- 2500 bp. It excludes peaks in these regions from the stitching process.

# bash
base=/mnt/f/BCellRework/ChIP_Seq
for f in "$base"/BAMs/K27AC/TS*K27AC.sorted.BL_removed.bam; do
    samp=${f##*/};
    samp=${samp%.*};
    short=${samp%%_*};
    echo "processing $samp";
    python ROSE_main.py -g hg19 -r "$f" -t 2500 -c "$base"/BAMs/INPUT/"$short"_INPUT.sorted.bam \
    -o ../RESULTS/"$short" -i "$base"/ROSE/PEAKS_GFF/"$short"_peaks.clean.gff;
done

Making Tracks

Ultimately, we want to be able to visualize the results as well, so we can create at least semi-normalized tracks that account for read depth to make them more visually comparable. Always take track visualizations with a grain of salt - use them as representations of what you’ve found through more thorough statistical analyses, never try to use them to make claims or you’re gonna have a bad time.

I use deepTools to generate RPKM-normalized signal bigwig generation, and ignore chrM, X, and Y for these purposes. It can be installed with a simple pip install deeptools.

# bash
mkdir ./ChIP_Seq/TRACKS
for f in ./ChIP_Seq/BAMs/*BL_removed.bam; do 
  echo "$f"; 
  bamCoverage --bam "$f" -o ./ChIP_Seq/TRACKS/"$f".rpkm.bw --binSize 10 \
  --ignoreForNormalization chrX chrM chrY --extendReads 150 \
  --normalizeUsing RPKM -p max/2; 
done

For the peaks, look up the bigNarrowPeak format and convert to it with bedToBigBed. You will have to fetch the hg19 chromosome sizes first with UCSC’s fetchChromSizes utility. The kentUtils tools will give you everything you need for this. You might have to change the score in the 5th column for a few lines by running something like:

# bash
for f in *.narrowPeak; do
  awk '{if ($5 > 1000) $5 = 1000; print $0}' "$f" > "$f".fix;
  bedToBigBed -as=bigNarrowPeak.as -type=bed6+4 "$f".fix hg19.chrom.sizes "$f".bb;
done

Analysis with R

Install this package if you haven’t already with devtools::install_github("j-andrews7/OneLinerOmics"). If you don’t have devtools, it can be installed with install.packages("devtools").

# R
suppressPackageStartupMessages(library(OneLinerOmics))

QC

First, we’ll QC samples, as ChIP tends to be much more variable than RNA-seq in terms of quality. In particular, we’re interested in cross-correlation values and the fraction of reads in peaks, which will help to identify poor quality samples where our IP just didn’t work all that well.

These are difficult samples to work with, after all, and we had very low numbers of cells for many of them. Similar to the RNA-seq analysis, we will use a sample sheet, but this one requires a more strict structure. The columns shown are the only ones allowed. We will also use these sample sheets later on, so if you have more complicated comparisons you want to do, be sure to structure your sheet appropriately so that they are modeled fully within one column.

Pay careful attention to provide the correct file paths to your peak and read files as well as the input controls.

# R
# Basic sample sheet.
samples <- read.table("./ChIP_Seq/ChIPQC/H3AC/SampleSheet_H3AC_QC_narrow.NoCTRL.csv", header=TRUE, sep = ",")
head(samples, 10)

I use ChIPQC to run QC. This package can be a…challenge to use. It is capable of generating a report, but often errors, so this package just uses metrics without the report.

The authors are also slow to keep it up to date, so new R/Bioconductor versions break it all the time. I actually had to download it from another repo that included a fix to get it to run this time around. Installing this package should install that version, so you shouldn’t have to mess with it.

# R
h3ac.exp <- RunChIPQC(outpath = "./ChIP_Seq/ChIPQC/H3AC/", 
  samplesheet = "./ChIP_Seq/ChIPQC/H3AC/SampleSheet_H3AC_QC_narrow.NoCTRL.csv")

faire.exp <- RunChIPQC("./ChIP_Seq/ChIPQC/FAIRE/",
  "./ChIP_Seq/ChIPQC/FAIRE/SampleSheet_FAIRE_QC_narrow.NoCTRL.csv")

h3k4me1.exp <- RunChIPQC("./ChIP_Seq/ChIPQC/H3K4ME1/",
  "./ChIP_Seq/ChIPQC/H3K4ME1/SampleSheet_H3K4ME1_QC_narrow.NoCTRL.csv")

h3k27ac.exp <- RunChIPQC("./ChIP_Seq/ChIPQC/H3K27AC/",
  "./ChIP_Seq/ChIPQC/H3K27AC/SampleSheet_H3K27AC_QC_narrow.NoCTRL.csv")

In addition to the metrics, this function will generate a plot of the peak overlaps between samples, showing how overlap criteria will impact your consensus peak set. By default, any peaks that overlap in at least 2 samples will be retained and merged. Making this more stringent will help reduce noise at the cost of potentially leaving out real peaks that are unique to a few samples. This curve usually exhibits a near geometric drop-off, with the number of overlapping peaks halving for each increase in strictness. If the drop-off is very steep, it likely means many of your peaks are false positives (or your samples are highly variable). High peak agreement will lead to a less severe drop-off.

Remove poor quality samples

Kick ’em off the sample sheets. I removed samples with the following metrics: * FAIRE - <2.5% Reads in Peaks * H3AC - <8.5% Reads in Peaks * H3K4ME1 - <7.5% Reads in Peaks * H3K27AC - <4.5% Reads in Peaks

Picked rather arbitrarily, but generally want at least 2-3x the typical RiP of the input samples.

Perform differential binding analysis

Another one-liner. This analysis uses DiffBind along with ChIPseeker for peak annotation and enrichR (wrapped by EZscRNA) for enrichment analyses.

I have 4 different marks of interest here, so I’m going to run it on each of them as well as the super enhancers from the H3k27ac data. From our QC up above, I know that I want to bump up the number of samples containing a peak for it to be merged, so I set n.consensus = 3 for all but the super enhancers.

# R
# For annotation.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

ses <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/H3K27AC_SE/SampleSheet_H3K27AC_DiffBind_SEs.csv", 
                   txdb = txdb, 
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(0.5, 1, 2),
                   heatmap.preset = "BrTe",
                   scale.full = FALSE)

h3ac <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3AC/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/H3AC/SampleSheet_H3AC_DiffBind.csv", 
                   txdb = txdb,
                   n.consensus = 3,
                   se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(1, 2),
                   heatmap.preset = "OrPu",
                   scale.full = FALSE)
faire <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/FAIRE/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/FAIRE/SampleSheet_FAIRE_DiffBind.csv", 
                   txdb = txdb,
                   n.consensus = 3,
                   se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(1, 2),
                   heatmap.preset = "PuGr",
                   scale.full = FALSE)
h3k4me1 <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3K4ME1/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/H3K4ME1/SampleSheet_H3K4ME1_DiffBind.csv", 
                   txdb = txdb,
                   n.consensus = 3,
                   se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(1, 2),
                   heatmap.preset = "BuOr",
                   scale.full = FALSE)
h3k27ac <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3K27AC/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/H3K27AC/SampleSheet_H3K27AC_DiffBind.csv", 
                   txdb = txdb,
                   n.consensus = 3,
                   se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(1, 2),
                   heatmap.preset = "BrTe",
                   scale.full = FALSE)

Merging Marks

In this case, I want to generate a peakset to be used across all the marks for differential binding, allowing for easier identification of peaks with multiple marks altered between two groups of samples. Since I’ve already removed a fair amount of noise by setting n.consensus = 3, I’m just going to concatenate and merge the consensus peaks from each mark. Then I’ll edit my samplesheets above to use this new peakset for each sample and re-run the differential binding analysis for each mark.

Here I’m just concatenating and merging the position from a random comparison - it’ll be the same positions across all comparisons, as we’re not filtering at all, so it doesn’t matter which is used.

# bash
tail -n +2 -q ./ChIP_Seq/DiffBind/H3K27AC/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt \
  ./ChIP_Seq/DiffBind/FAIRE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt \
  ./ChIP_Seq/DiffBind/H3AC/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt \
  ./ChIP_Seq/DiffBind/H3K4ME1/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt | \
  cut -f 1-3 - | sort -k1,1V -k2,2n - >  ./ChIP_Seq/DiffBind/MergedPeaks/All.Peaks.bed

bedtools merge ./ChIP_Seq/DiffBind/MergedPeaks/All.Peaks.bed > ./ChIP_Seq/DiffBind/MergedPeaks/MergedPeaks.bed

Then I edited my sample sheets to use the merged peakset for all samples.

Running DiffBind on a Merged, Consensus Peakset

Now the analysis can be run again using the new samplesheets to generate results for the same peaks across all marks.

# R

# Using the consensus SEs for H3ac, FAIRE, and H3K4me1 as well, as I want to be able to show signal correlation.
ses_k27 <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons",
                    samplesheet = "./ChIP_Seq/DiffBind/H3K27AC_SE/SampleSheet_H3K27AC_DiffBind_SEs.csv", 
                    txdb = txdb, 
                    level = "Tissue", 
                    fdr.thresh = c(0.05, 0.01),
                    fc.thresh = c(0.5, 1, 2),
                    heatmap.preset = "BrTe",
                    scale.full = FALSE)
 
h3ac <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/MergedPeaks/H3AC/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/MergedPeaks/H3AC/SampleSheet_H3AC_DiffBind.csv", 
                   txdb = txdb,
                   n.consensus = 1,
                   se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(1, 2),
                   heatmap.preset = "OrPu",
                   scale.full = FALSE,
                   flank.anno = FALSE)

faire <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/MergedPeaks/FAIRE/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/MergedPeaks/FAIRE/SampleSheet_FAIRE_DiffBind.csv", 
                   txdb = txdb,
                   n.consensus = 1,
                   se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(1, 2),
                   heatmap.preset = "PuGr",
                   scale.full = FALSE,
                   flank.anno = FALSE)

h3k4me1 <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/MergedPeaks/H3K4ME1/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/MergedPeaks/H3K4ME1/SampleSheet_H3K4ME1_DiffBind.csv", 
                   txdb = txdb,
                   n.consensus = 1,
                   se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(1, 2),
                   heatmap.preset = "BuOr",
                   scale.full = FALSE,
                   flank.anno = FALSE)

h3k27ac <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/MergedPeaks/H3K27AC/AllComparisons",
                   samplesheet = "./ChIP_Seq/DiffBind/MergedPeaks/H3K27AC/SampleSheet_H3K27AC_DiffBind.csv", 
                   txdb = txdb,
                   n.consensus = 1,
                   se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
                   level = "Tissue", 
                   fdr.thresh = c(0.05, 0.01),
                   fc.thresh = c(1, 2),
                   heatmap.preset = "BrTe",
                   scale.full = FALSE,
                   flank.anno = FALSE)

I also want to run another comparison (BCL v Tonsils), so I will set that up as well.

Session Information

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.14.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2          rstudioapi_0.10     knitr_1.26         
##  [4] magrittr_1.5        MASS_7.3-51.4       R6_2.4.1           
##  [7] rlang_0.4.1         stringr_1.4.0       tools_3.6.1        
## [10] xfun_0.11           htmltools_0.4.0     yaml_2.2.0         
## [13] assertthat_0.2.1    digest_0.6.22       rprojroot_1.3-2    
## [16] pkgdown_1.4.1       crayon_1.3.4        bookdown_0.16      
## [19] BiocManager_1.30.10 fs_1.3.1            memoise_1.1.0      
## [22] evaluate_0.14       rmarkdown_2.0       stringi_1.4.3      
## [25] compiler_3.6.1      desc_1.2.0          backports_1.1.5    
## [28] jsonlite_1.6        reticulate_1.14

Analyzing ChIP-seq data

Jared Andrews

Revised: February 25th, 2020