Clade Structure Grammar

From Biowiki
Jump to: navigation, search

Clade-specific structure grammar

Motivation

  • Detecting species-specific structured elements, particularly in viruses
  • Estimating rate/frequency of these events, as well as their 'location' on species tree

Relevant Papers

Examples

S_* are structural states, whereas N_* are non-structural. This example leads me to believe that the hybrid-chains are not working correctly, since the mutations in A,B ideally shouldn't affect the S_2 state. However, I've only just set up the grammar and not trained it or finely-set the parameters...

Example 1: "structure" conserved in C and D

# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_cladeStructure -121.775
A								  AAAAAACGAACTCTTTTTT
B								  AGAAAACGAAGCATATTTT
C								  AAAAAACGATCCTTTTTTT
D								  AAAAAACGATCCTTTTTTT
#=GC N_AB						...................
#=GC N_CD						...................
#=GC N_root					 ...................
#=GC S_AB						...................
#=GC S_CD						--<<<<--<---->->>>>
#=GC S_root					 ...................
#=GC S_CD: ideal			  <<<<<<------>>>>>>  
//

Using pfold grammar:

# STOCKHOLM 1.0
#=GF NH			  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_pfold -90.3301
A					  AAAAAACGAACTCTTTTTT
B					  AGAAAACGAAGCATATTTT
C					  AAAAAACGATCCTTTTTTT
D					  AAAAAACGATCCTTTTTTT
#=GC PFOLD		  ...................
//

After macro-ifying, now works better than before (?). There was some small parameter fiddling I suppose. Also a different alignment.

# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -122.65
#=GC CS		.................
#=GC BP		.................
A				ATTAAACGAACTCTGGT
B				AGAACACGAAGCATATT
C				AAAAAACGATCCTTTTT
D				AAAAAACGATCCTTTTT
#=GC N_AB	 .................
#=GC N_CD	 .................
#=GC N_root  .................
#=GC S_AB	 .................
#=GC S_CD	 <<<<<<<--->>>>>>>
#=GC S_root  .................

Example 2: structure conserved in all leaves. Should annotate to S_root, but doesn't:

# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_cladeStructure -60.1976
A								  AAAAAACGAACCTTTTTTT
B								  AAAAAACGAACCTTTTTTT
C								  AAAAAACGATCCTTTTTTT
D								  AAAAAACGATCCTTTTTTT
#=GC N_AB						...................
#=GC N_CD						...................
#=GC N_root					 ...................
#=GC S_AB						-------------------
#=GC S_CD						...................
#=GC S_root					 ...................
//
[oscar@sheridan testXrateGrammars]$ xrate -g pfold.eg allStem.stk
# STOCKHOLM 1.0
#=GF NH			  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_pfold -62.0713
A					  AAAAAACGAACCTTTTTTT
B					  AAAAAACGAACCTTTTTTT
C					  AAAAAACGATCCTTTTTTT
D					  AAAAAACGATCCTTTTTTT
#=GC PFOLD		  <<<<<<......>>>>>>.
//

Example 3: partially-forced state path. Alignments are annotated with binary variables indicating whether or not a structure is emitted from given state.

# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -119.883
A								  ATTAAACGAACTCTGGT
B								  AGAACACGAAGCATATT
C								  AAAAAACGATCCTTTTT
D								  AAAAAACGATCCTTTTT
#=GC N_AB						.................
#=GC N_CD						.................
#=GC N_root					 .................
#=GC S_AB						.................
#=GC S_AB_structBin		  00000000000000000
#=GC S_CD						<<<<<<<--->>>>>>>
#=GC S_CD_structBin		  11111111111111111
#=GC S_root					 .................
#=GC S_root_structBin		00000000000000000
//

For example, if we annotate the input alignment with having a structure in the AB clade (which is purposely unreasonable), we get the following:


[oscar@sheridan testXrateGrammars]$ cat practiceTrain.stk
# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -119.883
A								  ATTAAACGAACTCTGGT
B								  AGAACACGAAGCATATT
C								  AAAAAACGATCCTTTTT
D								  AAAAAACGATCCTTTTT
#=GC S_AB_structBin		  11111111111111111
//
[oscar@sheridan testXrateGrammars]$ xrate -g CS_ian.eg practiceTrain.stk
# STOCKHOLM 1.0
#=GF SC_max_CladeStructure -119.883
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -138.287
A								  ATTAAACGAACTCTGGT
B								  AGAACACGAAGCATATT
C								  AAAAAACGATCCTTTTT
D								  AAAAAACGATCCTTTTT
#=GC N_AB						.................
#=GC N_CD						.................
#=GC N_root					 .................
#=GC S_AB						<<<<<<<--->>>>>>>
#=GC S_AB_structBin		  11111111111111111
#=GC S_CD						.................
#=GC S_CD_structBin		  00000000000000000
#=GC S_root					 .................
#=GC S_root_structBin		00000000000000000
//

Automatic naming of internal nodes. By specifying a list of groups, each group's MRCA is named and later given it's own rate-shifted state. Here this is done for rhinoviruses A,B,C and poliovirus outgroup:

CladeStructureGrammar.Picture7.png

Problems with partially-annotated alignments.

  • Can't handle transitions between S and N states. This alignment has likelihood -inf:
# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A								  ATTAAACGAACTCTGGT
B								  AGAACACGAAGCATATT
C								  AAAAAACGATCCTTTTT
D								  AAAAAACGATCCTTTTT
#=GC S_AB_structBin		  00000000111111111
#=GC N_root					11111111000000000
//

  • HOwever, transitions between N states are ok:
# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A								  ATTAAACGAACTCTGGT
B								  AGAACACGAAGCATATT
C								  AAAAAACGATCCTTTTT
D								  AAAAAACGATCCTTTTT
#=GC N_AB				  00000000000111111
#=GC N_root					11111111111000000
#=GC N_CD						00000000000000000
#=GC S_AB_structBin		  00000000000000000
#=GC S_CD_structBin		  00000000000000000
#=GC S_root_structBin		00000000000000000

//

Solution to previous example

  • The rule i_N* -> j_S k_N forces each structural region to be flanked by a nonstructural region(see examples below). This really ought to be changed in the production rules...

This alignment has nonzero likelihood:

[oscar@garibaldi cladeStructure]$ xrate -g grammars/cladeStructure.eg alignments/practice2.stk
# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
#=GF SC_max_CladeStructure -149
A								  ATTAAACGAACTCTGGTG
B								  AGAACACGAAGCATATTG
C								  AAAAAACGATCCTTTTTG
D								  AAAAAACGATCCTTTTTG
#=GC N_AB						111111111110000001
#=GC N_CD						000000000000000000
#=GC N_root					 000000000000000000
#=GC S_AB						..................
#=GC S_AB_structBin		  000000000000000000
#=GC S_CD						...........<<-->>.
#=GC S_CD_structBin		  000000000001111110
#=GC S_root					 ..................
#=GC S_root_structBin		000000000000000000
//

Whereas this one evaluates to -inf:


[oscar@garibaldi cladeStructure]$ cat alignments/practice3.stk
# STOCKHOLM 1.0
#=GF NH						  ((A:0.1000000000,B:0.1000000000)AB:0.1000000000,(C:0.1000000000,D:0.1000000000)CD:0.1000000000)root;
A								  ATTAAACGAACTCTGGTG
B								  AGAACACGAAGCATATTG
C								  AAAAAACGATCCTTTTTG
D								  AAAAAACGATCCTTTTTG
#=GC N_AB		  		111111111110000000
#=GC N_root 				  000000000000000000
#=GC N_CD						000000000000000000
#=GC S_AB_structBin		  000000000000000000
#=GC S_CD_structBin		  000000000001111111
#=GC S_root_structBin		000000000000000000

  • Possible workaround:
    • "top level" nonterminal which spits out structured and nonstructured nonterminals. Do away with transformations between S and N states
		START -> Top
		Top -> i_N_Top_bif -> i_N Top
		Top -> i_S_Top_bif -> i_S Top
		Top -> END

-- Oscar Westesson - 01 Oct 2009