Lab 1: Alignment¶

During this lab, we will acquaint ourselves with alignment. Your objectives are:

Familiarize yourself with the software, how to execute it, how to visualize results.
Regarding your dataset, tell me how some of these genes are related to their homologous copies.

Step 1: Launch and AMI. For this exercise, we will use a c4.2xlarge instance.

ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com

The machine you are using is Linux Ubuntu: Ubuntu is an operating system you can use (I do) on your laptop or desktop. One of the nice things about this OS is the ability to update the software, easily. The command sudo apt-get update checks a server for updates to existing software.

sudo apt-get update && sudo apt-get -y upgrade

OK, what are these commands? sudo is the command that tells the computer that we have admin privileges. Try running the commands without the sudo – it will complain that you don’t have admin privileges or something like that. Careful here, using sudo means that you can do something really bad to your own computer – like delete everything, so use with caution. It’s not a big worry when using AWS, as this is a virtual machine- fixing your worst mistake is as easy as just terminating the instance and restarting.

> So now that we have updates the software, lets see how to add new software. Same basic command, but instead of the update or upgrade command, we’re using install. EASY!!

sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre ncbi-blast+

Finish up with GitHub¶

Go to the github repo you just created, and find the box on the middle right side of the screen labelled HTTPS clone URL. Copy the text that is in that box, then go back to your terminal.

git clone https://github.com/macmanes-lab/fri_lecture.git
cd fri_lecture  #you're going to ``cd`` into some other directory, whatever you named your repo.

##lets make a file, for fun

nano test_file.txt


git add test_file.txt
git commit -m 'i am adding my first new file, YAY'
git push  #You're going to have to type in your Github user name and password

##Go to github, you might have to reload teh page, and see that your file has been added.

##now lets edit that file ``test_file.txt``

nano test_file.txt

git commit -m 'whoa, version control FTW!!'
git push

##Go to github, you might have to reload the page, click on the filename, then history.

Install mafft and RAxML¶

Let’s install mafft so that we can do an alignment (http://mafft.cbrc.jp/alignment/software/)

cd $HOME
curl -LO http://mafft.cbrc.jp/alignment/software/mafft-7.245-without-extensions-src.tgz
tar -zxf mafft-7.245-without-extensions-src.tgz
cd mafft-7.245-without-extensions/core
sudo make && sudo make install
cd

Now lets install RAxML so that we can make a phylogeny. (http://sco.h-its.org/exelixis/web/software/raxml/index.html)

cd $HOME
git clone https://github.com/stamatak/standard-RAxML.git
cd standard-RAxML/
make -f Makefile.PTHREADS.gcc
PATH=$PATH:$(pwd)
cd

remember, for blasting we need both some data (a query) and a database. Lets start with the data 1st. You will have one of the 5 different datasets. Do you remember how to use the curl command from last week?

sudo chown ubuntu /mnt
cd /mnt
curl -LO https://s3.amazonaws.com/gen711/dataset1.pep

Download the database. For this exercise we will use Swissprot, which is the same curated database we used last week

curl -LO https://s3.amazonaws.com/gen711/uniprot.pep

Make blast database and blast¶

make a blast database

makeblastdb -in uniprot.pep -out uniprot -dbtype prot

**For teh query sequnces, I am going to pull out 2 specific sequences from the larger dataset dataset1.pep. These are both HOX genes that I have identified ahead of time. The command grep finds words/numbers/symbols in files. Remember that the > sends the results of the command to a file, in this case names query.pep.

grep -A1 'ENSPTRP00000032491\|ENSPTRP00000032494' dataset1.pep > query.pep

Now that I have a couple of query sequences, we are ready to blast.

blastp -evalue 8e-8 -num_threads 8 -db uniprot -query query.pep -max_target_seqs 3 -outfmt "6 qseqid pident evalue stitle"

You will see the results in a table with 4 columns. Use blastp -help to see what the results mean. Test out some of the blast options. Try changing the word size -word_size, scoring matrix, evalue, cost to open or extend a gap. See how these changes affect the results.

Make a file that contains the sequences for the blast hits you just discovered

grep --no-group-separator -A1 'HXA2_HUMAN\|HXA2_BOVIN\|HXA2_PAPAN\|HXA3_HUMAN\|HXA3_MOUSE\|HXA3_BOVIN\|HXA9_HUMAN' uniprot.pep > results.pep

**Now, make a file that containsBOTH the query sequences AND the sequences we found by blasting.

cat query.pep results.pep > for_alignment.pep

mafft¶

Align the proteins using mafft

mafft --reorder --bl 80 --localpair --thread 8 for_alignment.pep > for.tree

RAxML¶

Make a phylogeny

raxmlHPC-PTHREADS -f a -m PROTCATBLOSUM62 -T 8 -x 34 -N 100 -n tree -s for.tree -p 35

Copy phylogeny and view online.

more RAxML_bipartitionsBranchLabels.tree

#copy this info.

Visualize tree on website: http://www.evolgenius.info/evolview/

Click on the folder icon in teh top-left part of the page.
Paste in the code from your terminal. FYI, this is the NEWICK tree format, yes, named after Newick’s just down the road from us!!
Find the HOX9 gene - this is the outgroup sequence we will use to rood the tree. Hower over teh branch and it will turn red - a box will open, click “reroot here”