vignettes/ChIPseqAnalysis.Rmd
ChIPseqAnalysis.Rmd
This vignette walks through ChIP-seq analysis of the B cell lymphoma ChIP-seq data included in the future Andrews et al. paper. It uses the convenience functions included in this package to streamline the process.
Several command line tools as well as python are also utilized - each code block should indicate whether it is bash
, R
, or python
code. If your system is set up properly, you can still run many of these code chunks in Rmarkdown files within RStudio, assuming you have all of the software installed.
All data used here is available in GSE145841.
Some of this code is best ran on a computing cluster, as it takes a significant amount of time locally. Such instances will have - cluster
appended to the language statement at the beginning of the code block.
Required Software:
That is, if you ask me one of the following questions and haven’t read/tried these first, I’m going to tell you to read/try them.
-m
and -q
parameters for MACS2 to adjust the sensitivity, but usually it just means your IP didn’t work all that well.PATH
does not seem to work.mcfork
errors about from ChIPQC/DiffBind?”
register(SerialParam())
will usually fix this.In order to generate tracks, perform peak calling, and compare signal, we need to align all of our files to the genome. This process will also sort and removed blacklisted reads from the resulting BAM
file, then index it. It requires your genome sequence in FASTA
format, which you can easily download. Using a different genome version (or organism) is easy - just swap out the version (hg19 -> hg38) or look for your organism.
# bash
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
gunzip hg19.fa.gz
# Grab chromosome sizes, need these for blacklisting.
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.chrom.sizes
mkdir ./ChIP_Seq/ref
mv hg19.chrom.sizes ./ChIP_Seq/ref/hg19.chrom.sizes
bowtie2
also has to build an index for the genome prior to use.
Blacklisted regions are those that have high artificial signal in pretty much every ChIP experiment - usually due to being extremely repetitive or near centromeres. ENCODE identified these regions, so we can remove all such reads so they don’t bias our normalization/differential binding analyses later on. A (probably quicker) alternative whould be to remove peaks that overlap these regions.
# bash
wget https://www.encodeproject.org/files/ENCFF001TDO/@@download/ENCFF001TDO.bed.gz
gunzip ENCFF001TDO.bed.gz
mv ENCFF001TDO.bed ./ChIP_Seq/ref/ENCODE_blacklists.bed
# Due to how samtools view works, we create a file of regions we WANT to keep rather than exclude.
bedtools complement -i ./ChIP_Seq/ref/ENCODE_blacklists.bed \
-g ./ChIP_Seq/ref/hg19.chrom.sizes > ./ChIP_Seq/ref/hg19_blacklist_regions_removed.bed
Now we can perform the actual alignment.
# bash - cluster
# Directory containing bowtie indices, blacklist, etc.
genome="./ChIP_Seq/ref/hg19"
# Base directory containing reads and where additional directories will be created.
base_dir="./ChIP_Seq/READS/"
cd "$base_dir"
mkdir BAMS
# Alignment.
for f in ./ChIP_Seq/READS/*.fq.gz; do
echo Aligning "$f"
bowtie2 -p 6 -x ./ChIP_Seq/ref/hg19 -U "$f" | samtools view -@ 8 -b -S -u - \
| samtools sort -@ 8 -m 5G -o ./ChIP_Seq/BAMS/"${f%.*}".sorted.bam -;
samtools index ./ChIP_Seq/BAMS/"${f%.*}".sorted.bam
done
# Removal of blacklist reads.
for f in ./ChIP_Seq/BAMS/*.bam; do
echo Removing blacklisted reads from "$f"
base=${f##*/}
samtools view -@ 8 -b -t ./ChIP_Seq/Ref/hg19.fa \
-L ./ChIP_Seq/Ref/hg19_blacklist_regions_removed.bed \
-o ./ChIP_Seq/BAMS/"${base%.*}".BL_removed.bam "$f";
samtools index ./ChIP_Seq/BAMS/"${base%.*}".BL_removed.bam;
done
Another essential step is peak calling. Tons of peak callers exist, some are better at broad peak calling, some are made for TFs, etc etc. MACS is one of the most common and most well-regarded peak callers out there. It generally does pretty well even if you botch the options, and the author has recently began developing it again. The script below is meant to run on a batch of files that have a corresponding input/IgG sample within the same folder.
Here I use a q-value of 0.01, as this should help reduce peaks called in samples with high background. You may adjust them as you see fit.
# bash - cluster
# This example is for H3K27AC samples, hence the wildcard use to capture that.
for f in ./ChIP_Seq/BAMs/*C.sorted.BL_removed.bam; do
base=${f##*/}
macs2 callpeak -t "$f" -c ./ChIP_Seq/BAMs/*INPUT.sorted.BL_removed.bam -n "$base" \
--outdir ./ChIP_Seq/MACS/H3K27AC --tempdir ./ChIP_Seq/scratch/MACS2 -q 0.01 \
--nomodel --shiftsize=150;
done
ROSE was the original tool made for SE calling, and it’s still used today. Some other tools do the same thing, but they all take the same approach, so might as well stick with the OG, yeah?
First, we need our peaks in a gff
format.
# python
import glob
# Store file names
marks = ["FAIRE", "H3AC", "H3K27AC", "H3K4ME1"]
for m in marks:
in_files = glob.glob("./ChIP_Seq/MACS2/" + m + "/PEAKS_NARROW/*.narrowPeak")
for n in in_files:
base = ".".join(input_file.split(".")[0:-1])
output_file = "./ChIP_Seq/MACS2/" + m + "/PEAKS_NARROW/" + base + ".gff"
with open(n) as f:
# Open output file, "w" to make it writable
output_f = open(output_file, "w")
new_ID = 1
for line in f:
line = line.strip().split()
# Handle if bed file already has an ID that should be retained.
if len(line) > 3:
if line[3] is not None and line[3] is not ".":
ID = line[3]
else:
ID = new_ID
else:
ID = new_ID
print(line[0], ID, "PEAK", str(int(line[1]) + 1), line[2],
".", ".", ".", ID, sep="\t", file=output_f)
new_ID += 1
output_f.close()
Easy enough to use, just have to feed it the BAM
file for both your sample and the input along with the peaks. Requires samtools
and R
on your PATH
, so be sure both of those can be run. It also has to be run from the rose
folder wherever you cloned it - trying to add the rose
directory to your PATH
really doesn’t work. These are pretty standard options. -t
defines what you want to consider the promoter of a gene, in this case we’re using +/- 2500 bp. It excludes peaks in these regions from the stitching process.
# bash
base=/mnt/f/BCellRework/ChIP_Seq
for f in "$base"/BAMs/K27AC/TS*K27AC.sorted.BL_removed.bam; do
samp=${f##*/};
samp=${samp%.*};
short=${samp%%_*};
echo "processing $samp";
python ROSE_main.py -g hg19 -r "$f" -t 2500 -c "$base"/BAMs/INPUT/"$short"_INPUT.sorted.bam \
-o ../RESULTS/"$short" -i "$base"/ROSE/PEAKS_GFF/"$short"_peaks.clean.gff;
done
Ultimately, we want to be able to visualize the results as well, so we can create at least semi-normalized tracks that account for read depth to make them more visually comparable. Always take track visualizations with a grain of salt - use them as representations of what you’ve found through more thorough statistical analyses, never try to use them to make claims or you’re gonna have a bad time.
I use deepTools
to generate RPKM-normalized signal bigwig generation, and ignore chrM, X, and Y for these purposes. It can be installed with a simple pip install deeptools
.
# bash
mkdir ./ChIP_Seq/TRACKS
for f in ./ChIP_Seq/BAMs/*BL_removed.bam; do
echo "$f";
bamCoverage --bam "$f" -o ./ChIP_Seq/TRACKS/"$f".rpkm.bw --binSize 10 \
--ignoreForNormalization chrX chrM chrY --extendReads 150 \
--normalizeUsing RPKM -p max/2;
done
For the peaks, look up the bigNarrowPeak
format and convert to it with bedToBigBed
. You will have to fetch the hg19 chromosome sizes first with UCSC’s fetchChromSizes
utility. The kentUtils
tools will give you everything you need for this. You might have to change the score in the 5th column for a few lines by running something like:
Install this package if you haven’t already with devtools::install_github("j-andrews7/OneLinerOmics")
. If you don’t have devtools
, it can be installed with install.packages("devtools")
.
First, we’ll QC samples, as ChIP tends to be much more variable than RNA-seq in terms of quality. In particular, we’re interested in cross-correlation values and the fraction of reads in peaks, which will help to identify poor quality samples where our IP just didn’t work all that well.
These are difficult samples to work with, after all, and we had very low numbers of cells for many of them. Similar to the RNA-seq analysis, we will use a sample sheet, but this one requires a more strict structure. The columns shown are the only ones allowed. We will also use these sample sheets later on, so if you have more complicated comparisons you want to do, be sure to structure your sheet appropriately so that they are modeled fully within one column.
Pay careful attention to provide the correct file paths to your peak and read files as well as the input controls.
# R
# Basic sample sheet.
samples <- read.table("./ChIP_Seq/ChIPQC/H3AC/SampleSheet_H3AC_QC_narrow.NoCTRL.csv", header=TRUE, sep = ",")
head(samples, 10)
I use ChIPQC
to run QC. This package can be a…challenge to use. It is capable of generating a report, but often errors, so this package just uses metrics without the report.
The authors are also slow to keep it up to date, so new R/Bioconductor versions break it all the time. I actually had to download it from another repo that included a fix to get it to run this time around. Installing this package should install that version, so you shouldn’t have to mess with it.
# R
h3ac.exp <- RunChIPQC(outpath = "./ChIP_Seq/ChIPQC/H3AC/",
samplesheet = "./ChIP_Seq/ChIPQC/H3AC/SampleSheet_H3AC_QC_narrow.NoCTRL.csv")
faire.exp <- RunChIPQC("./ChIP_Seq/ChIPQC/FAIRE/",
"./ChIP_Seq/ChIPQC/FAIRE/SampleSheet_FAIRE_QC_narrow.NoCTRL.csv")
h3k4me1.exp <- RunChIPQC("./ChIP_Seq/ChIPQC/H3K4ME1/",
"./ChIP_Seq/ChIPQC/H3K4ME1/SampleSheet_H3K4ME1_QC_narrow.NoCTRL.csv")
h3k27ac.exp <- RunChIPQC("./ChIP_Seq/ChIPQC/H3K27AC/",
"./ChIP_Seq/ChIPQC/H3K27AC/SampleSheet_H3K27AC_QC_narrow.NoCTRL.csv")
In addition to the metrics, this function will generate a plot of the peak overlaps between samples, showing how overlap criteria will impact your consensus peak set. By default, any peaks that overlap in at least 2 samples will be retained and merged. Making this more stringent will help reduce noise at the cost of potentially leaving out real peaks that are unique to a few samples. This curve usually exhibits a near geometric drop-off, with the number of overlapping peaks halving for each increase in strictness. If the drop-off is very steep, it likely means many of your peaks are false positives (or your samples are highly variable). High peak agreement will lead to a less severe drop-off.
Kick ’em off the sample sheets. I removed samples with the following metrics: * FAIRE - <2.5% Reads in Peaks * H3AC - <8.5% Reads in Peaks * H3K4ME1 - <7.5% Reads in Peaks * H3K27AC - <4.5% Reads in Peaks
Picked rather arbitrarily, but generally want at least 2-3x the typical RiP of the input samples.
Another one-liner. This analysis uses DiffBind along with ChIPseeker for peak annotation and enrichR (wrapped by EZscRNA) for enrichment analyses.
I have 4 different marks of interest here, so I’m going to run it on each of them as well as the super enhancers from the H3k27ac data. From our QC up above, I know that I want to bump up the number of samples containing a peak for it to be merged, so I set n.consensus = 3
for all but the super enhancers.
# R
# For annotation.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ses <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/H3K27AC_SE/SampleSheet_H3K27AC_DiffBind_SEs.csv",
txdb = txdb,
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(0.5, 1, 2),
heatmap.preset = "BrTe",
scale.full = FALSE)
h3ac <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3AC/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/H3AC/SampleSheet_H3AC_DiffBind.csv",
txdb = txdb,
n.consensus = 3,
se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(1, 2),
heatmap.preset = "OrPu",
scale.full = FALSE)
faire <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/FAIRE/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/FAIRE/SampleSheet_FAIRE_DiffBind.csv",
txdb = txdb,
n.consensus = 3,
se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(1, 2),
heatmap.preset = "PuGr",
scale.full = FALSE)
h3k4me1 <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3K4ME1/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/H3K4ME1/SampleSheet_H3K4ME1_DiffBind.csv",
txdb = txdb,
n.consensus = 3,
se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(1, 2),
heatmap.preset = "BuOr",
scale.full = FALSE)
h3k27ac <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3K27AC/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/H3K27AC/SampleSheet_H3K27AC_DiffBind.csv",
txdb = txdb,
n.consensus = 3,
se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(1, 2),
heatmap.preset = "BrTe",
scale.full = FALSE)
In this case, I want to generate a peakset to be used across all the marks for differential binding, allowing for easier identification of peaks with multiple marks altered between two groups of samples. Since I’ve already removed a fair amount of noise by setting n.consensus = 3
, I’m just going to concatenate and merge the consensus peaks from each mark. Then I’ll edit my samplesheets above to use this new peakset for each sample and re-run the differential binding analysis for each mark.
Here I’m just concatenating and merging the position from a random comparison - it’ll be the same positions across all comparisons, as we’re not filtering at all, so it doesn’t matter which is used.
# bash
tail -n +2 -q ./ChIP_Seq/DiffBind/H3K27AC/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt \
./ChIP_Seq/DiffBind/FAIRE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt \
./ChIP_Seq/DiffBind/H3AC/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt \
./ChIP_Seq/DiffBind/H3K4ME1/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt | \
cut -f 1-3 - | sort -k1,1V -k2,2n - > ./ChIP_Seq/DiffBind/MergedPeaks/All.Peaks.bed
bedtools merge ./ChIP_Seq/DiffBind/MergedPeaks/All.Peaks.bed > ./ChIP_Seq/DiffBind/MergedPeaks/MergedPeaks.bed
Then I edited my sample sheets to use the merged peakset for all samples.
Now the analysis can be run again using the new samplesheets to generate results for the same peaks across all marks.
# R
# Using the consensus SEs for H3ac, FAIRE, and H3K4me1 as well, as I want to be able to show signal correlation.
ses_k27 <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/H3K27AC_SE/SampleSheet_H3K27AC_DiffBind_SEs.csv",
txdb = txdb,
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(0.5, 1, 2),
heatmap.preset = "BrTe",
scale.full = FALSE)
h3ac <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/MergedPeaks/H3AC/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/MergedPeaks/H3AC/SampleSheet_H3AC_DiffBind.csv",
txdb = txdb,
n.consensus = 1,
se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(1, 2),
heatmap.preset = "OrPu",
scale.full = FALSE,
flank.anno = FALSE)
faire <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/MergedPeaks/FAIRE/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/MergedPeaks/FAIRE/SampleSheet_FAIRE_DiffBind.csv",
txdb = txdb,
n.consensus = 1,
se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(1, 2),
heatmap.preset = "PuGr",
scale.full = FALSE,
flank.anno = FALSE)
h3k4me1 <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/MergedPeaks/H3K4ME1/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/MergedPeaks/H3K4ME1/SampleSheet_H3K4ME1_DiffBind.csv",
txdb = txdb,
n.consensus = 1,
se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(1, 2),
heatmap.preset = "BuOr",
scale.full = FALSE,
flank.anno = FALSE)
h3k27ac <- RunDiffBind(outpath = "./ChIP_Seq/DiffBind/MergedPeaks/H3K27AC/AllComparisons",
samplesheet = "./ChIP_Seq/DiffBind/MergedPeaks/H3K27AC/SampleSheet_H3K27AC_DiffBind.csv",
txdb = txdb,
n.consensus = 1,
se = "./ChIP_Seq/DiffBind/H3K27AC_SE/AllComparisons/NoBlock/ResultsTables/CLL-v-DL.AllPeaks.txt",
level = "Tissue",
fdr.thresh = c(0.05, 0.01),
fc.thresh = c(1, 2),
heatmap.preset = "BrTe",
scale.full = FALSE,
flank.anno = FALSE)
I also want to run another comparison (BCL v Tonsils), so I will set that up as well.
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.14.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 rstudioapi_0.10 knitr_1.26
## [4] magrittr_1.5 MASS_7.3-51.4 R6_2.4.1
## [7] rlang_0.4.1 stringr_1.4.0 tools_3.6.1
## [10] xfun_0.11 htmltools_0.4.0 yaml_2.2.0
## [13] assertthat_0.2.1 digest_0.6.22 rprojroot_1.3-2
## [16] pkgdown_1.4.1 crayon_1.3.4 bookdown_0.16
## [19] BiocManager_1.30.10 fs_1.3.1 memoise_1.1.0
## [22] evaluate_0.14 rmarkdown_2.0 stringi_1.4.3
## [25] compiler_3.6.1 desc_1.2.0 backports_1.1.5
## [28] jsonlite_1.6 reticulate_1.14