User:ENurse/sandbox

From Wikipedia, the free encyclopedia

History[edit]

Hidden Markov Models (HMMs) and the Baum-Welch algorithm were first descirbed in a series of articles by Leonard E. Baum and his peers at the Institute for Defense Analysis in the late 1960's[1] . One of the first major applications of HMMs was to the field of speech processing[2]. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information[3]. They have since become an important tool in the probabilistic modeling of genomic sequences[4] .

Description[edit]

A Hidden Markov Model describes the joint probability of a collection of 'hidden' and observed discrete random variables. It relies on the assumption that the hidden variable given the hidden variable is independent of previous hidden variables and the current observation variables depend only on the current hidden state.
The Baum-Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors.
Let be a discrete hidden random variable with possible values. We assume the is independent of time . We can present this information as a time independent stochastic transition matrix

The initial state distribution (i.e. when ) is given by
The observation variables can take one of possible values. The probability of a certain observation vector at time for state is given by: is a by matrix.

An observation sequence is given by

Thus we can describe a hidden Markov chain by . The Baum-Welch algorithm finds . (i.e. the HMM parameters that maximise the probability of the observation.)[5]

Algorithm[edit]

Set with random initial conditions. They can also be set using prior information about the parameters if it is available.

Forward Procedure[edit]

Let , the probability of seeing the and being in state at time . This is found recursively:

Backward Procedure[edit]

Let that is the probability of the ending partial sequence given starting state and time . We calculate as,

Update[edit]

We can now calculate the temporary variables:

which is the probability of being in state at time given the observed sequence and the parameters

which is the probability of being in state and at times and respectively given the observed sequence and parameters .

can now be updated:

which is the expected frequency spent in state at time .

which is the expected number of transitions from state i to state j compared to the expected total number of transitions away from state i.

where is an indicator function and is the expected number of times the output observations have been equal to while in state over the expected total number of times in state .
These steps are now repeated iteratively until a desired level of convergence.
Note: It is possible to over-fit a particular data set. That is . The algorithm also does not guarantee a global maximum

Example[edit]

Suppose we have a chicken from which we collect eggs at noon everyday. Now whether or not the chicken has laid eggs for collection depends on some unknown factors that are hidden. We can however (for simplicity) assume that there are only two states that determine whether the chicken lays eggs. Now we don't know the state at the initial starting point, we don't know the transition probabilities between the two states and we don't know the probability that the chicken lays an egg given a particular state[6] [7] . To start we first guess the transition and emission matrices.

Transition
State 1 State 2
State 1 0.5 0.5
State 2 0.3 0.7
Emission
No Eggs Eggs
State 1 0.3 0.7
State 2 0.8 0.2
Initial
State 1 0.2
State 2 0.8


We then take set of observations (E = eggs, N = no eggs): NN, NN, NN, NN, NE, EE, EN, NN, NN
The next step is to estimate a new transition matrix.

Observed sequence Probability of sequence and state is S1 then S2 Highest Probability of observing that sequence
NN 0.024 0.3584 S2,S2
NN 0.024 0.3584 S2,S2
NN 0.024 0.3584 S2,S2
NN 0.024 0.3584 S2,S2
NE 0.006 0.1344 S2,S1
EE 0.014 0.049 S1,S1
EN 0.056 0.056 S1,S2
NN 0.024 0.3584 S2,S2
NN 0.024 0.3584 S2,S2
Total 0.22 2.3898


Thus the new estimate for the S1 to S2 transition is now . We then calculate the S2 to S1, S2 to S2 and S1 to S1 transition probabilities and normalize so they add to 1. This gives us the updated transition matrix.
Next, we want to estimate a new emission matrix,

Observed Sequence Highest probability of observing that sequence if E is assumed to come from S1 Highest Probability of observing that sequence
NE 0.006 S2,S1 0.1344 S2,S1
EE 0.014 S1,S1 0.049 S1,S1
EN 0.056 S1,S2 0.056 S1,S2


This allows us to calculate the emission matrix as described above in the algorithm, by adding up the probabilities for the respective observed sequences. We then repeat for if N came from S1 and for if N and E came from S2 and normalize.
To estimate the initial probabilities we assume all sequences start with the hidden state S1 and calculate the highest probability and then repeat for S2. Again we then normalize to give an updated initial vector.
Finally we repeat these steps until the resulting probabilities converge satisfactorily.

Applications in Bioinformatics[edit]

Finding Genes[edit]

Prokaryotic[edit]

The GLIMMER (Gene Locator and Interpolated Markov ModelER) software was an early gene-finding program used for the identification of coding regions in prokaryotic DNA[8][9]. GLIMMER uses Interpolated Markov Models (IMMs) to identify the coding regions and distinguish them from the noncoding DNA. The latest release (GLIMMER3) has been shown to have increased specificity and accuracy compared with its predecessors with regard to predicting translation initiation sites, demonstrating an average 99% accuracy in locating 3' locations compared to confirmed genes in prokaryotes [10].

Eukaryotic[edit]

The GENSCAN webserver is a gene locator capable of analyzing eukaryotic sequences up to one million base-pairs (1 Mbp) long [11]. GENSCAN utilizes a general inhomogeneous, three periodic, fifth order Markov model of DNA coding regions. Additionally, this model accounts for differences in gene density and structure (such as intron lengths) that occur in different isochores. While most integrated gene-finding software (at the time of GENSCANs release) assumed input sequences contained exactly one gene, GENSCAN solves a general case where partial, complete, or multiple genes (or even no gene at all) is present[12]. GENSCAN was shown to exactly predict exon location with 90% accuracy with 80% specificity compared to an annotated database[13].

Copy-Number Variation Detection[edit]

Copy-number variations (CNVs) are an abundant form of genome structure variation in humans. A discrete-valued bivariate HMM (dbHMM) was used assigning chromosomal regions to seven distinct states: unaffected regions, deletions, duplications and four transition states. Solving this model using Baum-Welch demonstrated the ability to predict the location of CNV breakpoint to approximately 300 bp from micro-array experiments[14]. This magnitude of resolution enables more precise correlations between different CNVs and across populations than previously possible, allowing the study of CNV population frequencies. It also demonstrated a direct inheritance pattern for a particular CNV.


Implementations[edit]

hmmtrain in MATLAB

References[edit]

  1. ^ Rabiner, Lawrence. "First Hand: The Hidden Markov Model". IEEE Global History Network. Retrieved 2 October 2013.
  2. ^ Jelinek, F (May 1975). "Design of a linguistic statistical decoder for the recognition of continuous speech". IEEE Transactions of Information Theory. 21 (3): 250–6. doi:10.1109/TIT.1975.1055384. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)CS1 maint: date and year (link)
  3. ^ Bishop, M (20). "Maximum likelihood alignment of DNA sequences". J Mol Biol. 190 (2): 159–65. doi:10.1016/0022-2836(86)90289-5. PMID 3641921. {{cite journal}}: Check date values in: |date= and |year= / |date= mismatch (help); Unknown parameter |coauthors= ignored (|author= suggested) (help); Unknown parameter |month= ignored (help)
  4. ^ Durbin, Richard (1998). Biological Sequence Analysis. Cambridge, UK: Cambridge University Press.
  5. ^ Bilmes, Jeff A. (1998). A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Berkley, CA: International Computer Science Institute. pp. 7–13.
  6. ^ "Baum-Welch and HMM applications" (PDF). John Hopkins Bloomberg School of Public Health. Retrieved 2 October 2013.
  7. ^ Frazzoli, Emilio. "Intro to Hidden Markov Models the Baum-Welch Algorithm" (PDF). Aeronautics and Astronautics, Massachusetts Institute of Technology. Retrieved 2 October 2013.
  8. ^ Salzberg, Steven (1998). "Microbial gene identification using interpolated Markov Models". Nucleic Acids Research. 26 (2): 544–548. doi:10.1093/nar/26.2.544. PMC 147303. PMID 9421513. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  9. ^ "Glimmer: Microbial Gene-Finding System". Johns Hopkins University - Center for Computational Biology.
  10. ^ Delcher, Arthur (2007). "Identifying bacterial genes and endosymbiont DNA with Glimmer". Bioinformatics. 23 (6): 673–679. doi:10.1093/bioinformatics/btm009. PMID 17237039. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  11. ^ Burge, Christopher. "The GENSCAN Web Server at MIT". Retrieved 2 October 2013.
  12. ^ Burge, Chris (1997). "Prediction of Complete Gene Structures in Human Genomic DNA". J. Mol. Bio. 268 (268): 78–94. doi:10.1006/jmbi.1997.0951. PMID 9149143. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  13. ^ Burge, Christopher (1998). "Finding the Genes in Genomic DNA". Current Opinion in Structural Biology. 8 (3): 346–354. doi:10.1016/S0959-440X(98)80069-9. PMID 9666331. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
  14. ^ Korbel, Jan (12). "Systematic Prediction and Validation of Breakpoints asociated with Copy-Number Variations in the Human Genome". PNAS. 104 (24): 10110–5. doi:10.1073/pnas.0703834104. PMC 1891248. PMID 17551006. {{cite journal}}: Check date values in: |date= and |year= / |date= mismatch (help); Unknown parameter |coauthors= ignored (|author= suggested) (help); Unknown parameter |month= ignored (help)

External Links[edit]