Hand Align Tutorial

From Biowiki
Jump to: navigation, search

HandAlign tutorial


In this tutorial we'll use the Hand Align program, part of the DART package, to examine a set of SIV/HIV gp120 sequences, located at DartData:handalign/hiv_gp120/gp120.fa . These are a subset of the seed alignment for PFAM family PF00516 (Pfam:PF00516). We'll use HandAlign to generate an MCMC chain of sampled alignments, trees, and indel parameters (H,T,\theta). Our goal is both to estimate a MAP value for each of these as well as characterize the uncertainty associated with each.

Note that these are inherently interdependent - a particular alignment might change the most likely tree, just as a particular indel parameter might influence which alignment is considered most likely by the model. By co-sampling all three of these using MCMC, we escape this "chicken and egg" loop (which is typically pushed under the rug by assuming the true alignment, parameters, or tree is known).

Along the way we'll visualize the progression of the MCMC trace (using stockfilm.pl), identify unreliable regions of the final alignment (using constock.pl), extract trees using stocktree.pl, and find the MAP tree/alignment/parameter set.

Running MCMC with handalign

First we generate the MCMC samples. To run HandAlign on the gp120 alignment with 2000 total samples, logging the MCMC trace in gp120.trace.stk, and outputting the best (H,T,\theta) triplet in gp120.MAP.stk at the end, we use the following command:

handalign data/hiv_gp120/gp120.fa -ts 2000 -af gp120.trace.stk -ub > gp120.MAP.stk

Viewing the trace with stockfilm.pl

The MCMC chain over alignments can be visualized using the DART utility stockfilm.pl (see Stockholm Tools). This script renders the alignments visited by the chain one after another in a terminal, much like the frames of a movie. The -delay option lets the user set the delay time between each alignment.

Use the following command to visualize the above chain (you may need to adjust your terminal window size so as to accommodate the largest alignment visited):

cat gp120.trace.stk | dart/perl/stockfilm.pl

Summarizing the trace with constock.pl

To summarize the MCMC trace in the form of a single alignment, we can use constock.pl (another of the Stockholm Tools in DART). Two accuracy measures - TCS (total column score) and SPS (sum of pairs score) - can be used to guide the selection of a representative "consensus alignment." Columns of this consensus alignment can either be colored according to their posterior probability (e.g. the frequency with which the column appears in the trace), or by amino acid/DNA character.

To summarize our MCMC run that we created above, we'll use the SPS score and posterior coloring:

dart/perl/constock.pl gp120.trace.stk -sps -pp

This results in something like the following (only the last group of rows of the alignment is shown): columns are colored according to their posterior probability, and the "#=GC PP 999..." line above the alignment shows the posterior probability of each column (multiplied by ten and rounded to a single digit, so "0" means probabilities between 0 and 0.1, "1" means probabilities between 0.1 and 0.2, etc., all the way up to the maximum of 9). This makes it easy to eyeball unreliable alignment regions. In this case, we see that the hypervariable regions are nearly unalignable - the most frequent columns have posterior probability near zero. Caution should be exercised in performing downstream analysis (e.g. detecting selection) on these alignment regions!


_(click to view full size)_

The co-ordinates of the hypervariable regions hV4 and hV5 annotated in this figure were taken from the following paper:

Picking out the highest-scoring Stockholm alignment in the trace

The consensus alignment, which we just calculated using constock.pl, is slightly different from the MAP alignment. The consensus alignment represents a summary of the entire trace, pieced together from representative columns, whereas the MAP alignment is the (H,T,\theta) which has the greatest likelihood under the model. To get the MAP alignment, one needs simply to invoke the -ub option, and the MAP estimates will be printed to STDOUT once the run is finished.

By directing the output to gp120.MAP.stk above, we have the best triplet saved in this file. We can view it using an alignment browser/editor, use it for other phylogenetic analyses, or extract the MAP parameter estimates.

We can also extract the tree included in this file using stocktree.pl (see Stockholm Tools), like this:

dart/perl/stocktree.pl gp120.MAP.stk > gp120.MAP.newick

This can then be easily visualized by tree viewers such as archaeopteryx:

java -jar forester.jar gp120.MAP.newick

Summarizing the parameter trace

The parameter trace can be summarized by standard MCMC analysis methods, since it is a standard continuously-valued parameter. Each sampled alignment stored in gp120.trace.stk is annotated with a #=GF INDELPARAMS tag which stores the Equilibrium-extend, delete rate, and delete extend parameters. Basic scripting and R commands were used to generate the plot below, wherein the MCMC trace is shown next to the posterior distribution estimated from the trace (after discarding the first 500 samples as burn-in).

In this case, we see that the posterior distribution of the indel rate is spread over the range 0.04-0.06 indels per substitution.


-- Oscar Westesson - 30 Dec 2011