Prot Pal is a software tool for multiple sequence alignment, ancestral reconstruction, and measurement of indel rates on a phylogenetic tree. The mathematical details of Prot Pal's algorithm are described in detail here.
Protpal is distributed as part of the dart package.
As an alternative to downloading a tarball, developers who have the 'git' tool installed on their systems can clone the git repository like so:
git clone git://github.com/ihh/dart.git
Prot Pal is currently distributed as part of the dart package. To install it, do the following:
- Download and unpack the dart package from the links above
- Type cd dart; ./configure; make protpal (see building dart for more info)
- The protpal executable is created in the dart/bin subdirectory.
Basic usage is like this:
protpal -fa SEQUENCE_FILE
SEQUENCE_FILE is assumed to be unaligned sequences in Fasta Format.
Some of the most commonly-used command-line options are:
|-h||Print long help message including all command-line options|
|-stk FILE||Load stockholm format sequences from specified file. If this file has a #=GF NH line, this tree will be used for alignment|
|-t STRING||Newick tree string from file|
|-tf FILE||Load newick tree string, in double quotes|
|-b FILE||load handalign point subsitution model (e.g. rate matrix) from specified file|
|-a||Only display leaf alignment (no ancestral sequences); "alignment mode"|
|-e N||Maximum allowed distance between aligned leaf characters (default 300)|
|-d FLOAT||Deletion rate (default .0025)|
|-i FLOAT||Insertion rate (default .0025)|
|-x FLOAT||Gap extend probability (default .9)|
|-g FILE||load xrate-format chain from specified file, for use in final character reconstruction|
|-m N||Maximum number of delete states in sampled DAG (default 1000)|
|-n N||Number of alignment paths to sample at each node (default 10)|
|-sa||Show alignment sampled at each node (default False)|
|-s||Instead of aligning , simulate a set of unaligned sequnces according to the specified models (default False)|
|-rl INT||Instead of sampling from the root transducer, force the root sequence to have this length|
|==-arpp==||Report posterior probabilities of alternate reconstructions, conditional on ML indel reconstruction|
|==-marp <FLOAT>==||Minimum probability to report for -arpp option (default is 0.01)|
|-ep||Estimate parameters for branch transducer (not yet implemented...coming soon)|
It is a good idea, but not essential, that your chain file (Handalign format markov chain substitution model) and grammar file (Xrate Format grammar file, containing a single-character chain) are in accordance.
For a quick look at Prot Pal in action, try the following command from the DART directory:
./bin/protpal -stk src/protpal/testing/testSeqs.stk -sa true -n 5 -g data/handalign/prot1.hsm
This runs Prot Pal on a small alignment, printing out 5 sampled alignments at each internal node, using the specified chain.
Notes for beta testers, power users, developers
Downloading via git is highly preferred. This way, you'll have access to updates, bug fixes, etc, as soon as we make them.
You can also browse the source code at github.
New feature requests
Practical matters, simulation results
We have applied Prot Pal to sequences up to 10kb in length (e.g. small viral genomes) and trees up to size ~600 nodes with reasonable success. One important caveat with Prot Pal is that it assumes a known tree, and strictly adheres to the tree structure in creating an alignment. If the tree is wrong, this could cause the resulting alignment to appear "unreasonable". For this reason, caution should be used when attempting to align many sequences (especially if they are distantly related).
We have written a paper describing results on simulated data (submitted). Essentially, ancestral alignments (assigning sequences to all tree nodes, not just leaves) were simulated, and the unaligned leaf sequences were fed to various alignment programs (Prot Pal, PRANK, MUSCLE, Prob Cons, FSA, CLUSTALW). Prot Pal and PRANK are capable of ancestral reconstruction whereas partially-randomized parsimony (in the case of 'ties') was used to augment the remaining programs' alignments to ancestral alignments. Indel counts were tabulated for each and used to estimate insertion and deletion rates for each program. The lambda.pl script packaged with the simulation program DAWG was also used to estimate rates from MUSCLE and FSA alignments. Though it is the only other program (besides Prot Pal) to claim to be able to estimate indel rates, it is the least accurate method.
This was repeated for 5 indel rates (0.005, 0.01, 0.02, 0.04, and 0.08) and 3 substitution rates (0.5, 1.0, 2.0), with 100 replicates for each pair of rates. The true rate and the inferred rate are then compared - below we show the ratio of the inferred to true rates aggregated across all rate categories. The "True simulated history" shows the ideal distribution - tightly clustered around inferred/true=1 (red dashed line). The further away the distribution stretches from 1, the worse the rate estimation. The root-mean-squared error (RMSE) is a convenient measure how 'how far from the 1 the distribution tends to stray) - lower numbers indicate a more accurate set of rate estimates. The true simulated history, Prot Pal, and Prob Cons are the top three by this statistic.
Values greater than one indicate overestimates of the rate, and values less than one indicate underestimates. Prot Pal and PRANK (and, surprisingly, Prob Cons) are the most accurate aligners - many traditional progressive aligners (e.g. MUSCLE, CLUSTALW) overestimate the presence of deletions and underestimate insertions - a predictable consequence of not modeling indels as phylogenetic events.
Links to the papers describing the algorithm behind Prot Pal and the simulation benchmark results will appear here as soon as they become available.