Lab 5: khmer¶

During this lab, we will acquaint ourselves with digital normalization. You will:

Install software and download data
Quality and adapter trim data sets.
Apply digital normalization to the dataset.
Count and compare kmers and kmer distributions in the normalized and un-normalized dataset.
Plot in RStudio.

The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf The Khmer manual: http://khmer.readthedocs.org/en/v1.1 Skewer: http://www.biomedcentral.com/1471-2105/15/182 Seqtk: https://github.com/lh3/seqtk

> Step 1: Launch and AMI. For this exercise, we will use a c4.2xlarge instance. ADD A 100GB HARD DRIVE TO YOUR INSTANCE

ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com

> Update Software

sudo apt-get update && sudo apt-get -y upgrade

> Install other software

sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre libboost1.55-all python-pip gfortran libreadline-dev

> Install Skewer

cd $HOME
git clone https://github.com/relipmoc/skewer.git
cd skewer
make
PATH=$PATH:$(pwd)

> Install Jellyfish

cd $HOME
wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
tar -zxf jellyfish-2.1.3.tar.gz
cd jellyfish-2.1.3/
./configure
make -j4
PATH=$PATH:$(pwd)/bin

> Install seqtk. Make sure you look at the seqtk website. This is a really powerful package.

cd $HOME
git clone https://github.com/lh3/seqtk.git
cd seqtk
make -j4
PATH=$PATH:$(pwd)

> Install Khmer

cd $HOME
sudo easy_install -U setuptools
sudo pip install screed pysam khmer

> Download data and a file with the Illumina adapters, TruSeq3-PE.fa

cd $HOME
curl -LO https://s3.amazonaws.com/gen711/TruSeq3-PE.fa
mkdir -p $HOME/reads && cd $HOME/reads
curl -L https://s3.amazonaws.com/Mc_Transcriptome/Thomas_McBr1_R1.PF.fastq.gz > kidney.1.fq.gz &
curl -L https://s3.amazonaws.com/Mc_Transcriptome/Thomas_McBr1_R2.PF.fastq.gz > kidney.2.fq.gz

> Merge –> Trim low quality bases and adapters from dataset –> count kmers –> make a histogram. Normalize in the 1nd command. Make sure you know what is going on here!

mkdir $HOME/trimming && cd $HOME/trimming


trim=2
norm=30

#paste the below lines together as 1 command

seqtk mergepe $HOME/reads/kidney.1.fq.gz $HOME/reads/kidney.2.fq.gz \
  | skewer -l 25 -m pe --mean-quality $trim --end-quality $trim -t 8 -x $HOME/TruSeq3-PE.fa - -1 \
  | jellyfish count -m 25 -s 700M -t 8 -C -o /dev/stdout /dev/stdin \
  | jellyfish histo /dev/stdin -o trimmed.no.normalize.histo

#and

#paste the below lines together as 1 command

seqtk mergepe $HOME/reads/kidney.1.fq.gz $HOME/reads/kidney.2.fq.gz \
  | skewer -l 25 -m pe --mean-quality $trim --end-quality $trim -t 8 -x $HOME/TruSeq3-PE.fa - -1 \
  | normalize-by-median.py --max-memory-usage 4e9 -C 30 -o - - \
  | jellyfish count -m 25 -s 700M -t 8 -C -o /dev/stdout /dev/stdin \
  | jellyfish histo /dev/stdin -o trimmed.yes.normalize.histo

> Open up a new terminal window using the buttons command-t. You should be entering the scp command on your local computer (not on AWS).

scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/home/ubuntu/trimming/*histo ~/Downloads/

> Now, on your MAC, open up RStudio and plot.

> OPEN RSTUDIO

#Import all 2 histogram datasets: this is the code for importing 1 of them..

y_khmer <- read.table("~/Downloads/trimmed.yes.normalize.histo", quote="\"")
n_khmer <- read.table("~/Downloads/trimmed.no.normalize.histo", quote="\"")

#What does this plot show you??

barplot(c(n_khmer$V2[1],y_khmer$V2[1]),
    names=c('Non-normalized', 'C50 Normalized'),
    main='Number of unique kmers')

# plot differences between non-unique kmers

plot(y_khmer$V2[0:300] - n_khmer$V2[0:300], type='l',
    xlim=c(0,300), xaxs="i", yaxs="i", frame.plot=F,
    ylim=c(-20000,20000), col='red', xlab='kmer frequency',
    lwd=4, ylab='count',
    main='Diff in 25mer counts of \n normalized vs. un-normalized datasets')
abline(h=0)

> What do the analyses of kmer counts tell you?

Lab 5: khmer¶

TERMINATE YOUR INSTANCE¶