This program implements the techniques described in Z. Weinberg and W.L. Ruzzo (2004) "Faster genome annotation of non-coding RNA families without loss of accuracy", in Proc. Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB), ACM Press, 243-251.
The files in the main directory (source code and this README file) is copyright 2004 by Zasha Weinberg and distributed under the BSD license.
The 'NotByZasha' directory contains 3rd-party libraries. CFSQP is distributed by www.aemdesign.com. It is not freely available, but is available on request for free for academic institutions. lp_solve and Opt++ are distributed under the lesser GNU public license (LGPL). Infernal is distributed under the GNU public license (GPL).
First make the dependencies:
Then, in the main directory, and type 'make'. In theory, everything should be done. There should be two executables in the release directory, cm2hmm and cm2hmmsearch.
release/cm2hmm creates a compact- or expanded-type HMM from a given CM. release/cm2hmmsearch searches a FASTA sequence file using a CM and profile HMM rigorous filters created using cm2hmm. Both programs display simple usage instructions when run without any parameters.
Here's an example of creating both compact- and expanded-type HMMs for RF00095, and scanning the Pyrococcus abyssi genome.
Enter the following commands (which each take a minute or so to complete):
release/cm2hmm data/RF00095.cm data/RF00095_compact.hmm file data/Ecoli_0mm.mm compact cfsqp 0 1
release/cm2hmm data/RF00095.cm data/RF00095_expanded.hmm file data/Ecoli_0mm.mm expanded cfsqp 0 1
release/cm2hmmsearch 150 23.5 data/RF00095.cm data/RF00095_compact.hmm data/RF00095_expanded.hmm data/AL096836.fna 1
The first two commands create the HMMs given the CM in data/RF00095.cm. They are both optimized based on a 0th-order Markov model of the E. coli K-12 genome. The last command uses these HMMs to accelerate a search of the Pyrococcus abyssi genome (data/AL096836.fna). The search outputs the family members found in basically the same format as Infernal. An important new piece of information is the 'frac let thru so far', which gives the filtering fraction measured on this genome. The reported filtering fraction is for the 2nd HMM, i.e. the expanded-type one. (2d-fracLetsThru is a measure of the filtering fraction that attempts to reflect the fact that the dynamic programming algorithm for CMs has an extra dimension, so the filtering fraction is a somewhat pessimistic estimate of the actual speed-up).
Note: the code for the OptNIPS solver (using the Opt++ package) isn't working.
The choice of Markov model in the infinite-length forward algorithm does not usually affect the filtering fraction that much, but a good choice can yield a modest improvement in filtering fraction (typically around 10%). In general, it's best to use the 0th-order model of the genome that has the highest (worst) filtering fraction. To estimate this, create a compact-type HMM from any model, and run it on the Bordetella, E. coli and S. aureus genomes.
Once you've picked a 0th-order Markov model, the easiest thing to do is to create both compact- and expanded-type HMMs, and run them on the three genomes. This yields an estimate of the filtering fraction for the two HMMs. If the filtering fraction of the compact-type HMMs is above 0.25, it's probably not worth using it (this is based on a rule of thumb that the expanded-type HMM runs 30% slower than the compact-type HMM, so if the compact-type fraction is above 0.25, it's not worth using it). If the compact-type HMM filtering fraction is low, there's no need to use the expanded-type HMM, but it can't hurt.
The difference in speed between the CM and the HMMs is mainly dependent on the window length W. The HMM is faster than the CM by a factor of usually a bit over W. So, if the filtering fraction is significantly below 1/W, then the search time is dominated by the HMM's search time, and there's no point in getting a better filtering fraction.
This section assumes you've read the above RECOMB paper, including the Appendix. (The terminology used in the source code is more closely related to the terminology used in the CM literature.)
This section is intended to point out the key parts of the source code relating to searching with profile HMM rigorous filters, as an aid to understanding the code.
The top-level main function that handles the cm2hmmsearch command is in Cm2HmmSearchMain.cpp. The key function in this file is the first 'Cm2Hmm_Search' function. Basically, this function just iterates over sequences ("while (sequenceSet.Next())"), scanning each sequence, like the analogous function in cmsearch.c in Infernal.
The profile HMM creation code uses a representation of profile HMMs (in Infernal.h) that is heavily based on Infernal's representation of CMs. In this representation the start state is 0, the end state is the highest-numbered state, and the children of a state have higher numbers than their parent. The direction that child edges point is a hassle for scanning, since we'd like to start with the start states and work towards the end states. So, I've created another representation of HMMs (in HmmType1.h) in which the children are lower-numbered states. This is just a different representation of an equivalent HMM, but makes the scanning code easier to write. The HmmType1 type also structures the data somewhat differently in memory in order to get a slight performance boost in scanning. The main difference is that the emission score array's first dimension is the nucleotide, which promotes better cache usage.
A key data structure in the Cm2Hmm_Search function is the 'HitList' data type (declared in cmzasha.h, and defined in cmzashaUtils.cpp). This data structure represents the list of ranges of nucleotides that must be searched by the CM (i.e. the rigorous filter was not able to eliminate). HitList derives from the STL list type. Each element in the list is an interval defined by the members 'first' and 'second'. The interval is half-open, i.e. it includes the nucleotides first, first+1, first+2, ..., second-2, second-1 but does not include the nucleotide second. With respect to 'first' and 'second', the first nucleotide position is numbered 0.
The function 'ApplyFilters' scans a range of nucleotides with profile HMMs and returns the HitList of intervals that the CM must scan.
The code contains functions that implement the Viterbi HMM parse algorithm in order to implement a rigorous filter. These functions are in ScanHMM_NonTemplated.cpp. The main function is 'ScanHmm_HmmType1Float_NonTemplated'. This function scans a sequence and returns a HitList of the intervals that the rigorous filter cannot eliminate. The input sequence is represented by a HitList, which typically just contains a single interval spanning the entire sequence. The key loop in the function iterates over all intervals in the input HitList, and scans that interval.
The function 'ScanHmm_HmmType1Float_NonTemplated_Window' scans one interval ("interval" is a synonym of "window"...). This is basically the standard HMM Viterbi algorithm, which starts by initializing the dynamic programming table, then updates the table for each position. The algorithm only stores the partial table at 2 nucleotide positions, which saves memory (a standard trick); these two positions are represented by prevTable and currTable.
The main difference between this function and the standard HMM Viterbi algorithm is that the HMM scores are compared to the CM's score threshold; if they exceed the threshold, an interval is merged into the interval list. (section 4.1 of the paper).
The search code computes the filtering fraction, which is reported to the user. This is implemented by the 'FracLetsThruCounter' object (declared in cmzasha.h, defined in cmzashaUtils.cpp). This object counts the number of nucleotides that the CM must be run on, and the total number of nucleotides; the ratio of these two numbers is the filtering fraction. (The object is slightly more complicated in that it also computes (1) the 2-d filtering fraction, which reflects more the expected speed-up of the CM Viterbi algorithm, as described above, and (2) can operate on blocks of sequence, reporting statistics for each block. The latter functionality is not demonstrated in this code.)
To update the counts after a sequence is scanned, the function 'ProcessPruning' must be called. The number of nucleotides scanned is computed by inspecting 'inputHmmList', and the number of nucleotides that the CM will be run on comes from 'outputHmmList'.
The 'main' function for the cm2hmm command is in Cm2HmmMain.cpp. The functions in this file basically just parse the command line and load data structures (e.g. the CM), before dispatching to the 'HmmOptimizer_NodeCombiner' function (in Cm2HmmOptimize.cpp).
The 'HmmOptimizer_NodeCombiner' function is the top-level function to implement the logic to create profile HMMs. It dispatches to other code to create the structure of the profile HMM (i.e. the grammar, section 4.2 of the paper) and to determine the linear inequalities (section 4.3 of the paper). The function is focussed on implementing the iterative procedure in section 4.4.2 of the paper. It cycles through CM nodes, and optimizes the corresponding HMM nodes using the infinite-length forward algorithm objective function, holding the scores in all other nodes as fixed. At each CM node it largely uses other code to implement the actual optimization procedure. Once the number of iterations are exceeded, or all nodes are optimized without much improvement, the loop ends; the profile HMM is saved to a file, and the program exits.
For simplicity, the reader should assume that numAdjacentNodesToMerge=1 and maxNodesAtATime=1. (These parameters were designed to see if the profile HMMs were any better if more than one node is optimized at a time, instead of considering each node in isolation; this idea wasn't discussed in the paper, mainly because it doesn't appear to improve the results.)
The variable 'infernalHmm' holds the profile HMM (including scores) as it's being optimized. To facilitate the optimization problems based on the infinite-length forward algorithm, the scores in the HMM are mapped onto variables, numbered starting at 0. These are called 'globalVars' and the function 'GetGlobalVarsFromInfernalHmm' creates a vector containing the globalVars, which are set from the scores in 'infernalHmm'.
'NodeCombinerForwardInfSymbolicObjectiveFunc' is a class that implements the infinite-length forward algorithm objective function, with derivatives. It's implementation is discussed later. In terms of its interface, it has functions to evaluate the objective function and its derivatives, and to retrieve the list of linear inequalities. This objective function object is passed to a solver (e.g. CFSQP) to optimize the node.
'problemVars' is the subset of HMM score variables that relate to the current CM node. The objective function knows which variables are in the current node-specific problem.
Once we find the problemVars vector, we can evaluate the infinite-length forward algorithm score (i.e. the objective function), and report this to the user. Then we see if we should stop, i.e. if we've cycled through all CM nodes and haven't made much improvement in the objective function.
If we're supposed to continue, we invoke the solver (via the 'solverWrapper' object) to solve the optimization problem for this node. Then we map the optimal problemVars back to globalVars, and set these scores into the 'infernalHmm'.
The higher-level function for this is 'Cm2Hmm_WithWeighting_NoCaching' (in Cm2HMM.cpp). This code does three things:
The structure of the HMM is recursively created by the 'Cm2Hmm_Structurally' function (the first function with this name in Cm2HMM.cpp). This basically handles CM bifurcation states, where the profile HMMs for the two children must be spliced together.
The creation of a profile HMM for a CM that is free of bifurcation states is done by 'Cm2Hmm_Structurally_Block'. This is the place where the distinction between compact-type and expanded type is made.
The function 'SetupTransitionAndEmissionVariables' creates a mapping between transitions and emissions in the HMM and a set of variables.
The process of creating linear inequalities and solving the simple linear program is done by the function 'Cm2Hmm_SolveScoresForPath', which treats each CM node independently. The linear inequalities are created by the function 'Cm2Hmm_MakeInequalitiesForPath', which works by exploring all paths from each CM state to a CM state that's in the next node (in the paper, this relates to Appendix section B, the 2nd-last paragraph beginning "In making constraints..."). Every time such a path is found, we have a new inequality (as it's exploring sub-paths, the function keeps track in the variable 'inequalitySoFar' of the score of the CM path, and which HMM transition/emission score variables are used in the corresponding HMM sub-path).
The function 'SolveInequalities' creates the linear objective function and solves the resulting linear program, using the lp_solve package.
At a high level, the infinite-length forward algorithm is implemented by the 'NodeCombinerForwardInfSymbolicObjectiveFunc' object (declared in Cm2HmmOptimize.h and defined in Cm2HmmOptimize.cpp). This object implements the 'ObjectiveFunc' interface (declared at the top of Cm2HmmOptimize.h), which also requires it to know about the linear inequalities. This object is basically a wrapper of other functions. Its constructor ('NodeCombinerForwardInfSymbolicObjectiveFunc::NodeCombinerForwardInfSymbolicObjectiveFunc') is complicated by the fact that the code deals with the possibilities of optimizing more than one node at the same time, a feature not discussed in the paper; for simplicity, assume that numAdjacentNodesToMerge=1 and maxNodesAtATime=1.
The handling of the linear inequalities is simpler than it might appear; really, all the code is doing is looking up the linear inequalities made in Cm2HMM.cpp, and changing the variable numbers to be problemVars instead of globalVars, i.e. take the subset of variables that are actually related to the given CM node, and renumber them consecutively starting at 0. This is for the convenience of the solver programs, which assume a set of consecutively numbered variables.
The handling of the objective function is somewhat trickier. Recall that we wish to not only evaluate the infinite-length forward algorithm, but also take partial derivatives. To do this, we'll create a symbolic expression of the function, which will make it easier to take derivatives.
To create the symbolic expression, the program uses two things: (1) some function that actually implements the infinite-length forward algorithm and (2) a mechanism to trick this function into creating a symbolic expression instead of just returning a number. The function that evaluates the infinite-length forward algorithm is 'InfiniteLengthForwardAlg_LastStateDoesntEmit_StartWithInfernalHmm' (in ForwardHmm.cpp). Note that this function is templated on a numeric type 'Real'; we'll use a symbolic expression object in this template parameter (instead of a numeric type like 'double') in order to build up the symbolic expression.
The function itself is a lot like the HMM Forward Algorithm. To calculate emission "probabilities", the nucleotide probabilities are marginalized by the 0th-order Markov model, in the 'CalcExpectedEmitProb_0order' function. The other different versus the Forward Algorithm is the handling of self-loops; these must be detected, and an infinite geometric series must be summed.
Now, on to the symbolic expression. The code in SymbolicMath.h and SymbolicMath.cpp implements symbolic expressions. Each operator we might use (e.g. multiplication, addition, log to the base 2) is implemented by a different object derived from 'ExpressionNode' (a nested class of SymbolicMath). Thus, we can build up a directed acyclic graph (DAG) of these expression nodes to symbolically represent the infinite-length forward algorithm score.
We also need a fake numeric type (to replace 'Real' in the 'InfiniteLengthForwardAlg_LastStateDoesntEmit_StartWithInfernalHmm' function) that creates these expressions. This is implemented by the 'Expression' class (a nested class of SymbolicMath). If X and Y are both variables of type Expression, then the C++ expression X+Y will generate a symbolic expression with an addition object ('ExpressionNode_Add') whose two children are X and Y. By feeding this Expression type into the 'InfiniteLengthForwardAlg_LastStateDoesntEmit_StartWithInfernalHmm' function, the function will generate a symbolic expression. (We also have to define a fake HMM class that returns Expression objects when you ask it for transition/emission scores; this is implemented by the 'SymbolicProbVariableMath' object in Cm2HmmOptimize.h,.cpp.)
Based on the symbolic expression stored as a DAG, we can evaluate the infinite-length forward algorithm as needed. One slight complication is that since it's a DAG (not a tree), we have to be careful about not exploring nodes multiple times, since that'd require exponential time. To solve this problem, the 'ExpressionNode' base class implements a kind of dynamic programming, by caching its value when the overall expression is being evaluated; when we visit a node for the 2nd time, it simply retrieves its value.
Derivatives are handled using rules from introductory calculus. For example, consider an ExpressionNode_Mult object, which represents multiplication of two sub-expressions X and Y. Its partial derivative is given by the product rule: X*(dY) + (dX)*Y. For each ExpressionNode object, we can find similar formulas for the partial derivative.