From Biowiki
Jump to: navigation, search

Rec HMM is an HMM for detecting tree-topology changes in multiple sequence alignments (which may be due to recombination, or other reasons), developed by Oscar Westesson and Ian Holmes.

For more info, contact either of the above.

The paper describing the method and some results is available here:

The code is available through biowiki anonymous cvs (see Cvs Manual , project name: recHMM_dist ); you can browse the source code repository here. If you have trouble using CVS, you can also download a frozen version from here: . Be warned: there is some trickiness involving the C++ extension to properly interact with python, using SWIG. Do contact Oscar Westesson for help with this (some instructions are given in the [BiowikiCvsCheckout:recHMM_dist/INSTALL INSTALL] file distributed with the source code; more instructions are below).

Command-line usage

 ./ ALIGNMENTFILE  [ options ]


The command-line options (so far, I will add more functionality soon) are:

Option Meaning
-k N Set HMM states to N (default 2)
-s N Number of samples when initializing trees in HMM states (default 20)
-t N The number of independent EM trials to run (default 1)
-log Print helpful logging messages
-p Output a pdf in the form of posteriors_trial_${trial}.pdf of the final posterior distribution. This gives a visual display of where the breakpoints came from. (See example below)
-lt Output a file trees.txt which has the final trees from each state in newick format
==-lb== Output a file breakpoints.txt which has the final predicted breakpoints for each trial
-c N Set the minimum cutoff length for a recombinant region (default 50)
-maxiters N Set the maximum number of EM iterations


There are a few options for output, depending on what one is looking for, and how close one is willing to get to the data:

  • Output predicted breakpoints, clean and simple. This is the default which is printed to stdout after convergence. Or, you can specify -lb (ie. log breakpoints) and a file called breakpoints.txt will store the breakpoints, along with the sequence likelihood for each trial.
  • Output posterior state probabilities. The posterior distribution is essentially where the breakpoints were predicted from, and so looking at this gives a visual representation of the predictions with option -p . Colorized plots (see below) are produced for each trial. The interpretation is that each state is assigned a color, and the relative height of the color bars corresponds to its posterior probability of emitting a given column.
  • Output trees along with their predicted regions. If you're more interested in what phylogenies were predicted on each region, specifying -lt will create a file called trees.txt which will have each tree (in Newick Format ) along with the region on which it was predicted.

The following plot is from the analysis of a recently-sequenced Malaysian HIV-1 strain, suspected of being recombinant. 2 trees, 'grey' and 'black' partition the alignment into regions. Previous predicted breakpoints and their approximate location in the original plot are shown in red. Note that the recHMM has sharper cutoffs for its regions, and detects two regions (green dotted lines) which the original window-based method didn't see.




The algorithm is implemented mostly in Python, and it depends on the packages Num Py (for numerical calculations, matrix multiplication, etc) and Rp Y (plotting).

One step (the E-Step of Structural EM) implemented as a C++ extension module. This extension package helps the E-Step of structural EM happen about 20 times faster than the original Python implementation. As for compiling this, the only tricky bit is knowing how to tell the compiler to make the executable a 'shared library object'. This website has some good information:

First, make the relevant swig in-between file (do not actually look at the file, however!):

$swig -c++ -python -o memory_tree_wrap.cpp memory_tree.i

On linux, the compilation step is very simple. Just cd to the directory where the extension files are and type:

g++ -fpic -c memory_tree.cpp memory_tree_wrap.cpp -I/usr/include/python2.3 
g++ -shared memory_tree.o memory_tree_wrap.o -o

On OS X, I recall it was a bit trickier (the following should all be on one line, but has been broken up here for readability):

g++ -fPIC -I/Library/Frameworks/Python.framework/Versions/2.4/Headers
 -framework Python -bundle
 -bundle_loader /Library/Frameworks/Python.framework/Versions/2.4/bin/python
 memory_tree_wrap.cpp memory_tree.cpp

The following code works on OS X 10.7.3 with python v2.7 (YMMV):

swig -c++ -python -o memory_tree_wrap.cpp memory_tree.i
g++ -fPIC -I/usr/include/python2.7/ -I/usr/lib/python2.7 -framework Python -bundle -bundle_loader /usr/bin/python2.7 memory_tree_wrap.cpp memory_tree.cpp  -o

To test the installation, cd to the directory where the extension files are and type


Which will result in either "C++ extension was not successfully imported. Try re-doing SWIG compilation steps" or "C++ extension appears to be working properly"