Hidden Markov Model for multiple observed variables

Question

I am trying to use a hidden Markov model (HMM) for a problem where I have M different observed variables (Yti) and a single hidden variable (Xt) at each time point, t. For clarity, let us assume all observed variables (Yti) are categorical, where each Yti conveys different information and as such may have different cardinalities. An illustrative example is given in the figure below, where M=3.

enter image description here

My goal is to train the transition,emission and prior probabilities of an HMM, using the Baum-Welch algorithm, from my observed variable sequences (Yti). Let's say, Xt will initially have 2 hidden states.

I have read a few tutorials (including the famous Rabiner paper) and went through the codes of a few HMM software packages, namely 'HMM Toolbox in MatLab' and 'hmmpytk package in Python'. Overall, I did an extensive web search and all the resources -that I could find- only cover the case, where there is only a single observed variable (M=1) at each time point. This increasingly makes me think HMM's are not suitable for situations with multiple observed variables.

Is it possible to model the problem depicted in the figure as an HMM?
If it is, how can one modify the Baum-Welch algorithm to cater for training the HMM parameters based on the multi-variable observation (emission) probabilities?
If not, do you know of a methodology that is more suitable for the situation depicted in the figure?

Thanks.

Edit: In this paper, the situation depicted in the figure is described as a Dynamic Naive Bayes, which -in terms of the training and estimation algorithms- requires a slight extension to Baum-Welch and Viterbi algorithms for a single-variable HMM.

Are you talking about [something like this](http://vision.gel.ulaval.ca/~parizeau/Publications/P971225.pdf) ? — Roger Rowland, Jul 06 '13 at 08:16
Naïvely I'd say that you can model such a problem as a standard HMM with only one observation. If each obvervation `Yti` is element R, just make each multi-observation `Yt` an element in R^3. This will lead to n^3 possible multi-observations, if n is the number of possible single observations. However, this **might** (I am not sure/do not know) lead to the need for more training data compared to a more sophisticated approach. — aufziehvogel, Jul 06 '13 at 08:35
@ Aufziehvogel, well that is a possibility but as you point out, it inflates the probability space immensely and would require a lot of training data. Also, graphically, the model you suggest would not look like the one given in the picture. — Zhubarb, Jul 06 '13 at 11:54
@Roger, thanks, I came across this paper before. From what I understand in Section 4.1 Equation 43, for k observation sequences: P(O|lambda) is the product of P(O1|lambda) * P(O2|lamda)*...*P(Ok|lambda) if the sequences are independent. But I believe this is still for the case where O1,...,Ok are all observation sequences that belong to a single variable (e.g. daily heart rate readings of a patient), whereas in my scenario I want to be able to use heart rate + oxygen saturation + other separate readings (variables). — Zhubarb, Jul 06 '13 at 13:11
Ok I understand what you're getting at now but I can't intuitively see how it might work. The assumption I think you're making is that observations of different variables can all be explained by a single underlying state transition model, which is where I get anxious. If you train a number of HMMs, each on observations of a different variable, *do* you get similar HMM parameters as a result? If you do, then you only need to model the emission probabilities of each variable at each state change. If you don't then a single HMM is (probably) not suitable. — Roger Rowland, Jul 07 '13 at 09:04
If I'm correct in my understanding above, and bearing in mind the previous comments from @Aufziehvogel about the problem of a very sparsely populated probability space, it might be worth looking at a dynamic Bayesian network (DBN), which is a generalisation of HMM anyway (a DBN can have any number of state variables and evidence variables) Chapter 15 in [Russell and Norvig's AI book](http://aima.cs.berkeley.edu/) gives a good description. — Roger Rowland, Jul 07 '13 at 09:52
@Roger, Yes it may be the case that I have to use DBNs (which are infinitely more complex I guess..) I edited the answer, the situation depicted in my question is actually a Dynamic Naive Bayes. My efforts were towards trying to handle the situation with the simplest approach possible, i.e. in this case: HMMs. But I increasingly feel that HMMs are probably not fit for this problem definition. — Zhubarb, Jul 08 '13 at 08:26
I agree about the HMM issue - they are very specific for single observation cases, but DBN's can take advantage of the sparsely populated probability space, so they are well worth considering. I trawled through [Daphne Koller's PGM book](http://www.amazon.com/books/dp/0262013193) to see if there was any insight there, but it led me to DBNs again anyway so I guess HMM is out. Sounds like an interesting application though - good luck. — Roger Rowland, Jul 08 '13 at 08:31
FIY, it seems that the hmmlearn package in Python now allow the possibility of having several observed variables. But Matlab does not allow it yet. — PierreE, Sep 22 '17 at 16:38

score 7 · Accepted Answer · edited Apr 14 '16 at 04:17

7

The simplest way to do this, and have the model remain generative, is to make the y_is conditionally independent given the x_is. This leads to trivial estimators, and relatively few parameters, but is a fairly restrictive assumption in some cases (it's basically the HMM form of the Naive Bayes classifier).

EDIT: what this means. For each timestep i, you have a multivariate observation y_i = {y_i1...y_in}. You treat the y_ij as being conditionally independent given x_i, so that:

p(y_i|x_i) = \prod_j p(y_ij | x_i)

you're then effectively learning a naive Bayes classifier for each possible value of the hidden variable x. (Conditionally independent is important here: there are dependencies in the unconditional distribution of the ys). This can be learned with standard EM for an HMM.

You could also, as one commenter said, treat the concatenation of the y_ijs as a single observation, but if the dimensionality of any of the j variables is beyond trivial this will lead to a lot of parameters, and you'll need way more training data.

Do you specifically need the model to be generative? If you're only looking for inference in the x_is, you'd probably be much better served with a conditional random field, which through its feature functions can have far more complex observations without the same restrictive assumptions of independence.

edited Apr 14 '16 at 04:17

T D Nguyen

7,054
4
51
71

answered Jul 08 '13 at 08:47

Ben Allison

7,244
1
15
24

I do not mind having an undirected model at all. The two reasons I tried using HMMs were: 1- They seemed simple, 2- I actually do not know the states of the hidden variable I want to predict. In other words, I did not initially model the problem as a supervised learning one. Aufziehvogel's suggestion is not ideal since I won't have enough training data. By the way, I did not quite understand (probably the notation of) your initial remark on "y_is conditionally independent given the x_is" – Zhubarb Jul 08 '13 at 09:02
OK, so by not knowing the states, I assume you mean that you're doing this unsupervised, i.e. you don't have any training data, rather than you're just not sure of the labels. I've updated my answer, but it looks like the paper you found is suggesting the same thing – Ben Allison Jul 08 '13 at 16:50
And yes, if you don't have any training data, then the CRF (which is the discriminative version of the HMM) won't be an option – Ben Allison Jul 08 '13 at 16:55
I have a stream of M observed variables (notation as given in the question). I want to introduce a hidden state, Xt, to my problem domain and model it as an (extension to) HMM. This is motivated by that the system acts differently at different intervals and I want to be able to model these hidden 'state shifts' eventually by implementing Viterbi. So, yes it is 'unsupervised', in that I dont have labels or even a certain knowledge about the cardinality of Xt. I believe the Dynamic NB approach described in the paper is suitable for this. Would you agree? – Zhubarb Jul 09 '13 at 08:19
Yeah, I had a quick skim over the paper. I believe the dynamic naive bayes classifier is what I was proposing, and the standard HMM is the concatenated observation. The DNB thing will work---however, I realised you hadn't said what your features were. If they're continuous, rather than using the DNB you should look at the multi-variate Gaussian HMM, which is a standard model (observation distribution is multivariate normal, instead of the naive bayes distribution) – Ben Allison Jul 09 '13 at 11:19

Zhubarb · Answer 2 · 2013-07-09T13:41:10.483

I found that this can be achieved by modelling the system as a Dynamic Naive Bayes classifier (DNB), which is a slight extension of an ordinary (single-variable) HMM that can cater for multi-observation scenarios as shown in the figure.

Caution is advised in that DNB still has a hidden state and should therefore not be regarded as a direct sequential expansion of the original Naive Bayes classifier. The 'naive' in the algorithm's name originates from the fact that all observed variables are independent of each other, given the hidden state variable.

Similar to an HMM, the parameter estimations of this model can be achieved via the Baum Welch (or EM, whichever you prefer to name it) algorithm. Since the emission distribution at each time step is now the product of P(Yti|Xt) of each observed variable Yti, the forward, backward, and joint variable equations need to be slightly modified as described in section 3 of this paper by Aviles-Arriaga et al.

score 2 · Answer 3 · answered Mar 29 '16 at 18:22

2

The thing that you are looking for is called Structured Perceptron. Take a look at the following slid on page 42. http://www.cs.umd.edu/class/fall2015/cmsc723/slides/inclass_09.pdf

answered Mar 29 '16 at 18:22

Rouzbeh

2,141
1
11
19

score 0 · Answer 4 · answered Oct 01 '13 at 12:20

you could model the problem using tensors structure a tensor using the two time series and then identify the HMM parameters. "Hidden Markov Model Identifiability via Tensors" is a good reference for this.

Matlab provides tensor toolbox.

fyi, I am working on a related problem so feel free to email me if you want to discuss in a more private manner

score 0 · Answer 5 · answered Nov 10 '15 at 01:20

0

You can try hidden semi-Markov model which is an extension of hmm. It allows each state lasting for multiple time periods.

answered Nov 10 '15 at 01:20

Haidong Zhang

1

score 0 · Answer 6 · answered Jun 13 '19 at 13:40

0

This paper proposed an algorithm to solve the problem

answered Jun 13 '19 at 13:40

Dandelion

91
1
5

Hidden Markov Model for multiple observed variables

6 Answers6

Linked