Hidden Markov Models

Nikolai Shokhirev

Introduction

Hidden Markov models are widely used in science, engineering and many other areas (speech recognition, optical character recognition, machine translation, bioinformatics, computer vision, finance and economics, and in social science).

Definition: The Hidden Markov Model (HMM) is a variant of a finite state machine having a set of hidden states, Q, an output alphabet (observations), O, transition probabilities, A, output (emission) probabilities, B, and initial state probabilities, Π. The current state is not observable. Instead, each state produces an output with a certain probability (B). Usually the states, Q, and outputs, O, are understood, so an HMM is said to be a triple, ( A, B, Π ).

Formal Definition:

Hidden states Q = { q_i }, i = 1, . . . , N .

Transition probabilities A = {a_ij = P(q_j at t +1 | q_i at t)}, where P(a | b) is the conditional probability of a given b, t = 1, . . . , T is time, and q_i in Q. Informally, A is the probability that the next state is q_j given that the current state is q_i.

Observations (symbols) O = { o_k }, k = 1, . . . , M .

Emission probabilities B = { b_ik = b_i(o_k) = P(o_k | q_i) }, where o_k in O. Informally, B is the probability that the output is o_k given that the current state is q_i.

Initial state probabilities Π = {p_i = P(q_i at t = 1)}.

HMM

The model is characterized by the complete set of parameters: Λ = {A, B, Π }.

Canonical problems

There are 3 canonical problems to solve with HMMs:

Given the model parameters, compute the probability of a particular output sequence. This problem is solved by the Forward and Backward algorithms (described below).
Given the model parameters, find the most likely sequence of (hidden) states which could have generated a given output sequence. Solved by the Viterbi algorithm and Posterior decoding.
Given an output sequence, find the most likely set of state transition and output probabilities. Solved by the Baum-Welch algorithm

Forward Algorithm

Let α_t(i) be the probability of the partial observation sequence O_t = {o(1), o(2), ... , o(t)} to be produced by all possible state sequences that end at the i-th state.

α_t(i) = P(o(1), o(2), ... , o(t) | q(t) = q_i ).

Then the unconditional probability of the partial observation sequence is the sum of α_t(i) over all N states.

Observed and hidden sequences

The Forward Algorithm is a recursive algorithm for calculating α_t(i) for the observation sequence of increasing length t . First, the probabilities for the single-symbol sequence are calculated as a product of initial i-th state probability and emission probability of the given symbol o(1) in the i-th state. Then the recursive formula is applied. Assume we have calculated α_t(i) for some t. To calculate α_t+1(j), we multiply every α_t(i) by the corresponding transition probability from the i-th state to the j-th state, sum the products over all states, and then multiply the result by the emission probability of the symbol o(t+1). Iterating the process, we can eventually calculate α_T(i), and then summing them over all states, we can obtain the required probability.

Formal Definition

Initialization:

α₁(i) = p_i b_i(o(1)) , i =1, ... , N

Recursion:

here i =1, ... , N , t =1, ... , T - 1

Termination:

Backward Algorithm

In a similar manner, we can introduce a symmetrical backward variable β_t(i) as the conditional probability of the partial observation sequence from o(t+1) to the end to be produced by all state sequences that start at i-th state (3.13).

β_t(i) = P(o(t+1), o(t+2), ... , o(T) | q(t) = q_i ).

The Backward Algorithm calculates recursively backward variables going backward along the observation sequence. The Forward Algorithm is typically used for calculating the probability of an observation sequence to be emitted by an HMM, but, as we shall see later, both procedures are heavily used for finding the optimal state sequence and estimating the HMM parameters.

Formal Definition

Initialization:

β_T(i) = 1 , i =1, ... , N

According to the above definition, β_T(i) does not exist. This is a formal extension of the below recursion to t = T.

Recursion:

here i =1, ... , N , t = T - 1, T - 2 , . . . , 1

Termination:

Obviously both Forward and Backward algorithms must give the same results for total probabilities P(O) = P(o(1), o(2), ... , o(T) ).

Posterior decoding

There are several possible criteria for finding the most likely sequence of hidden states. One is to choose states that are individually most likely at the time when a symbol is emitted. This approach is called posterior decoding.

Let λ_t(i) be the probability of the model to emit the symbol o(t) being in the i-th state for the given observation sequence O.

λ_t(i) = P( q(t) = q_i | O ).

It is easy to derive that

λ_t(i) = α_t(i) β_t(i) / P( O ) , i =1, ... , N , t =1, ... , T

Then at each time we can select the state q(t) that maximizes λ_t(i).

q(t) = arg max {λ_t(i)}

Posterior decoding works fine in the case when HMM is ergodic, i.e. there is transition from any state to any other state. If applied to an HMM of another architecture, this approach could give a sequence that may not be a legitimate path because some transitions are not permitted.

Viterbi algorithm

The Viterbi algorithm chooses the best state sequence that maximizes the likelihood of the state sequence for the given observation sequence.

Let δ_t(i) be the maximal probability of state sequences of the length t that end in state i and produce the t first observations for the given model.

δ_t(i) = max{P(q(1), q(2), ..., q(t-1) ; o(1), o(2), ... , o(t) | q(t) = q_i ).}

The Viterbi algorithm is a dynamic programming algorithm that uses the same schema as the Forward algorithm except for two differences:

It uses maximization in place of summation at the recursion and termination steps.
It keeps track of the arguments that maximize δ_t(i) for each t and i, storing them in the N by T matrix ψ. This matrix is used to retrieve the optimal state sequence at the backtracking step.

Initialization:

δ₁(i) = p_i b_i(o(1))

ψ₁(i) = 0 , i =1, ... , N

According to the above definition, β_T(i) does not exist. This is a formal extension of the below recursion to .

Recursion:

δ_t( j) = max_i [δ_{t - 1}(i) a_ij] b_j(o(t))

ψ_t( j) = arg max_i [δ_{t - 1}(i) a_ij]

Termination:

p^* = max_i [δ_T( i )]

q^*_T = arg max_i [δ_T( i )]

Path (state sequence) backtracking:

q^*_t = ψ_t+1( q^*_t+1) , t = T - 1, T - 2 , . . . , 1

Baum-Welch algorithm

Let us define ξ_t(i, j), the joint probability of being in state q_i at time t and state q_j at time t +1 , given the model and the observed sequence:

ξ_t(i, j) = P(q(t) = q_i, q(t+1) = q_j | O, Λ)