Hidden Markov models are widely used in science, engineering and many other areas (speech recognition, optical character recognition, machine translation, bioinformatics, computer vision, finance and economics, and in social science).
Definition: The Hidden Markov Model (HMM) is a variant of a finite state machine having a set of hidden states, Q, an output alphabet (observations), O, transition probabilities, A, output (emission) probabilities, B, and initial state probabilities, Π. The current state is not observable. Instead, each state produces an output with a certain probability (B). Usually the states, Q, and outputs, O, are understood, so an HMM is said to be a triple, ( A, B, Π ).
Formal Definition:
Hidden states Q = { qi }, i = 1, . . . , N .
Transition probabilities A = {aij = P(qj at t +1 | qi at t)}, where P(a | b) is the conditional probability of a given b, t = 1, . . . , T is time, and qi in Q. Informally, A is the probability that the next state is qj given that the current state is qi.
Observations (symbols) O = { ok }, k = 1, . . . , M .
Emission probabilities B = { bik = bi(ok) = P(ok | qi) }, where ok in O. Informally, B is the probability that the output is ok given that the current state is qi.
Initial state probabilities Π = {pi = P(qi at t = 1)}.
HMM
The model is characterized by the complete set of parameters: Λ = {A, B, Π }.
There are 3 canonical problems to solve with HMMs:
Let αt(i) be the probability of the partial observation sequence Ot = {o(1), o(2), ... , o(t)} to be produced by all possible state sequences that end at the i-th state.
αt(i) = P(o(1), o(2), ... , o(t) | q(t) = qi ).
Then the unconditional probability of the partial observation sequence is the sum of αt(i) over all N states.
Observed and hidden sequences
The Forward Algorithm is a recursive algorithm for calculating αt(i) for the observation sequence of increasing length t . First, the probabilities for the single-symbol sequence are calculated as a product of initial i-th state probability and emission probability of the given symbol o(1) in the i-th state. Then the recursive formula is applied. Assume we have calculated αt(i) for some t. To calculate αt+1(j), we multiply every αt(i) by the corresponding transition probability from the i-th state to the j-th state, sum the products over all states, and then multiply the result by the emission probability of the symbol o(t+1). Iterating the process, we can eventually calculate αT(i), and then summing them over all states, we can obtain the required probability.
Formal Definition
Initialization:
α1(i) = pi bi(o(1)) , i =1, ... , N
Recursion:
here i =1, ... , N , t =1, ... , T - 1
Termination:
βt(i) = P(o(t+1), o(t+2), ... , o(T) | q(t) = qi ).
The Backward Algorithm calculates recursively backward variables going backward along the observation sequence. The Forward Algorithm is typically used for calculating the probability of an observation sequence to be emitted by an HMM, but, as we shall see later, both procedures are heavily used for finding the optimal state sequence and estimating the HMM parameters.
Formal Definition
Initialization:
βT (i) = 1 , i =1, ... , N
According to the above definition, βT(i) does not exist. This is a formal extension of the below recursion to t = T.
Recursion:
here i =1, ... , N , t = T - 1, T - 2 , . . . , 1
Termination:
Obviously both Forward and Backward algorithms must give the same results for total probabilities P(O) = P(o(1), o(2), ... , o(T) ).
There are several possible criteria for finding the most likely sequence of hidden states. One is to choose states that are individually most likely at the time when a symbol is emitted. This approach is called posterior decoding.
Let λ t(i) be the probability of the model to emit the symbol o(t) being in the i-th state for the given observation sequence O.
λ t(i) = P( q(t) = qi | O ).
It is easy to derive that
λ t(i) = αt(i) βt(i) / P( O ) , i =1, ... , N , t =1, ... , T
Then at each time we can select the state q(t) that maximizes λ t(i).
q(t) = arg max {λ t(i)}
Posterior decoding works fine in the case when HMM is ergodic, i.e. there is transition
from any state to any other state. If applied to an HMM of another architecture, this approach
could give a sequence that may not be a legitimate path because some transitions are not
permitted.
The Viterbi algorithm chooses the best state sequence that maximizes the likelihood of the state sequence for the given observation sequence.
Let δ t(i) be the maximal probability of state sequences of the length t that end in state i and produce the t first observations for the given model.
δ t(i) = max{P(q(1), q(2), ..., q(t-1) ; o(1), o(2), ... , o(t) | q(t) = qi ).}
The Viterbi algorithm is a dynamic programming algorithm that uses the same schema as the Forward algorithm except for two differences:
Initialization:
δ1(i) = pi bi(o(1)) |
ψ1(i) = 0 , i =1, ... , N |
According to the above definition, βT(i) does not exist. This is a formal extension of the below recursion to .
Recursion:
δt ( j) = max i [δt - 1(i) aij] b j (o(t)) |
ψt( j) = arg max i [δt - 1(i) aij] |
Termination:
p* = max i [δT( i )] |
q*T = arg max i [δT( i )] |
Path (state sequence) backtracking:
q*t = ψt+1( q*t+1) , t = T - 1, T - 2 , . . . , 1
Let us define ξ t(i, j), the joint probability of being in state q i at time t and state q j at time t +1 , given the model and the observed sequence:
ξ t(i, j) = P(q(t) = q i, q(t+1) = q j | O, Λ)
The figure below illustrates the calculation of ξ t(i, j).
Joint probability paths.
Therefore we get
The probability of output sequence can be expressed as
The probability of being in state q i at time t:
Initial probabilities:
Transition probabilities:
Emission probabilities:
In the above equation Σ * denotes the sum over t so that o(t) = ok.
© Nikolai Shokhirev, 2001 - 2025