Lab 5: khmer¶
During this lab, we will acquaint ourselves with digital normalization. You will:
- Install software and download data
- Quality and adapter trim data sets.
- Apply digital normalization to the dataset.
- Count and compare kmers and kmer distributions in the normalized and un-normalized dataset.
- Plot in RStudio.
The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf The Khmer manual: http://khmer.readthedocs.org/en/v1.1 Skewer: http://www.biomedcentral.com/1471-2105/15/182 Seqtk: https://github.com/lh3/seqtk
> Step 1: Launch and AMI. For this exercise, we will use a c4.2xlarge instance. ADD A 100GB HARD DRIVE TO YOUR INSTANCE
ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
> Update Software
sudo apt-get update && sudo apt-get -y upgrade
> Install other software
sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre libboost1.55-all python-pip gfortran libreadline-dev
> Install Skewer
cd $HOME
git clone https://github.com/relipmoc/skewer.git
cd skewer
make
PATH=$PATH:$(pwd)
> Install Jellyfish
cd $HOME
wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
tar -zxf jellyfish-2.1.3.tar.gz
cd jellyfish-2.1.3/
./configure
make -j4
PATH=$PATH:$(pwd)/bin
> Install seqtk. Make sure you look at the seqtk website. This is a really powerful package.
cd $HOME
git clone https://github.com/lh3/seqtk.git
cd seqtk
make -j4
PATH=$PATH:$(pwd)
> Install Khmer
cd $HOME
sudo easy_install -U setuptools
sudo pip install screed pysam khmer
> Download data and a file with the Illumina adapters, TruSeq3-PE.fa
cd $HOME
curl -LO https://s3.amazonaws.com/gen711/TruSeq3-PE.fa
mkdir -p $HOME/reads && cd $HOME/reads
curl -L https://s3.amazonaws.com/Mc_Transcriptome/Thomas_McBr1_R1.PF.fastq.gz > kidney.1.fq.gz &
curl -L https://s3.amazonaws.com/Mc_Transcriptome/Thomas_McBr1_R2.PF.fastq.gz > kidney.2.fq.gz
> Merge –> Trim low quality bases and adapters from dataset –> count kmers –> make a histogram. Normalize in the 1nd command. Make sure you know what is going on here!
mkdir $HOME/trimming && cd $HOME/trimming
trim=2
norm=30
#paste the below lines together as 1 command
seqtk mergepe $HOME/reads/kidney.1.fq.gz $HOME/reads/kidney.2.fq.gz \
| skewer -l 25 -m pe --mean-quality $trim --end-quality $trim -t 8 -x $HOME/TruSeq3-PE.fa - -1 \
| jellyfish count -m 25 -s 700M -t 8 -C -o /dev/stdout /dev/stdin \
| jellyfish histo /dev/stdin -o trimmed.no.normalize.histo
#and
#paste the below lines together as 1 command
seqtk mergepe $HOME/reads/kidney.1.fq.gz $HOME/reads/kidney.2.fq.gz \
| skewer -l 25 -m pe --mean-quality $trim --end-quality $trim -t 8 -x $HOME/TruSeq3-PE.fa - -1 \
| normalize-by-median.py --max-memory-usage 4e9 -C 30 -o - - \
| jellyfish count -m 25 -s 700M -t 8 -C -o /dev/stdout /dev/stdin \
| jellyfish histo /dev/stdin -o trimmed.yes.normalize.histo
> Open up a new terminal window using the buttons command-t. You should be entering the scp
command on your local computer (not on AWS).
scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/home/ubuntu/trimming/*histo ~/Downloads/
> Now, on your MAC, open up RStudio and plot.
> OPEN RSTUDIO
#Import all 2 histogram datasets: this is the code for importing 1 of them..
y_khmer <- read.table("~/Downloads/trimmed.yes.normalize.histo", quote="\"")
n_khmer <- read.table("~/Downloads/trimmed.no.normalize.histo", quote="\"")
#What does this plot show you??
barplot(c(n_khmer$V2[1],y_khmer$V2[1]),
names=c('Non-normalized', 'C50 Normalized'),
main='Number of unique kmers')
# plot differences between non-unique kmers
plot(y_khmer$V2[0:300] - n_khmer$V2[0:300], type='l',
xlim=c(0,300), xaxs="i", yaxs="i", frame.plot=F,
ylim=c(-20000,20000), col='red', xlab='kmer frequency',
lwd=4, ylab='count',
main='Diff in 25mer counts of \n normalized vs. un-normalized datasets')
abline(h=0)
> What do the analyses of kmer counts tell you?