Decoding sequences in a GaussianHMM

Question

I'm playing around with Hidden Markov Models for a stock market prediction problem. My data matrix contains various features for a particular security:

01-01-2001, .025, .012, .01
01-02-2001, -.005, -.023, .02

I fit a simple GaussianHMM:

from hmmlearn import GaussianHMM
mdl = GaussianHMM(n_components=3,covariance_type='diag',n_iter=1000)
mdl.fit(train[:,1:])

With the model (λ), I can decode an observation vector to find the most likely hidden state sequence corresponding to the observation vector:

print mdl.decode(test[0:5,1:])
(72.75, array([2, 1, 2, 0, 0]))

Above, I've decoded the hidden state sequence of an observation vector O_t = (O₁, O₂, ..., O_d) which contains the first five instances in a test set. I'd like to estimate the hidden state of the sixth instance in the test set. The idea is to iterate over a discrete set of possible feature values for the sixth instance, and select the observation sequence O_t+1 with highest likelihood argmax = P(O₁, O₂, ..., O_d+1 | λ ). Once we observe the true feature values of O_d+1, we can shift the sequence (of length 5) by one and do it all over again:

    l = 5
    for i in xrange(len(test)-l):
        values = []
        for a in arange(-0.05,0.05,.01):
            for b in arange(-0.05,0.05,.01):
                for c in arange(-0.05,0.05,.01):
                    values.append(mdl.decode(vstack((test[i:i+l,1:],array([a,b,c])))))
     print max(enumerate(values),key=lambda x: x[1])

The problem is that when I decode the observation vector O_t+1, the prediction with the highest likelihood is almost always the same (e.g. the estimate with highest likelihood always has feature values for O_d+1 that equal [ 0.04 0.04 0.04] and is hidden state [0]):

(555, (74.71248518927949, array([2, 1, 2, 0, 0, 0]))) [ 0.04  0.04  0.04]
(555, (69.41963358191555, array([2, 2, 0, 0, 0, 0]))) [ 0.04  0.04  0.04]
(555, (77.11516871816922, array([2, 0, 0, 0, 0, 0]))) [ 0.04  0.04  0.04]

It's entirely possible that I'm misunderstanding the purpose of mdl.decode, and thus using it incorrectly. If that's the case, how best can I go about iterating over possible values of O_d+1, and then maximizing P(O₁, O₂, ..., O_d+1 | λ)?

Is it possible for you to place the complete code, and not just teh snippet please. It will make the question more clear. — Jaffer Wilson, Aug 11 '17 at 06:55
If you keep resetting `values` on each iteration, your final `print` statement will only refer to `values` from the last loop. — andrew_reece, Aug 13 '17 at 17:13
JafferWilson: will do. @andrew_reece: I think that's the expected behavior. "i" represents the start of a 5-tick observation sequence test[i:i+l]. The idea is to iterate over a discrete set of possible feature values for the sixth (unobserved) instance [a,b,c], decode the new sequence with the estimate of the sixth instance included, and select as the final sequence that with highest probability. Once complete, we reset the values object, shift sequence position by one, and estimate the next observation set. Maybe i just need to do this with forward-backward algorithm ... — datasci, Aug 14 '17 at 01:48

andrew_reece · Answer 1 · 2017-08-15T23:08:22.037

Are your actual values bounded [-0.05, 0.05)?

When I initially built a sample dataset to look at your problem, I drew random floats in [0,1]. When I did that, I also got the same results you're observing - always the maximum value of (a,b,c) for the sixth entry, for every sequence, and always the same predicted class. But given that the distribution of my data (evenly distributed between 0 and 1) had a greater central tendency than the grid search values for the sixth entry (between -.05 and .05), it made sense that HMM always picked the highest value three-tuple (.04,.04,.04), as that was closest to the main density of the distributions it had been trained on.

When I rebuilt the dataset with draws from a uniform distribution over the same range of possible values as we allow for the sixth entry ([-0.05,0.05)), the output was much more varied: both O_t+1 selection and class prediction showed reasonable differentiation across sequences. From your example data, it does seem like you do at least have both positive and negative values, but you might try plotting out the distribution for each feature and see if your range of possible sixth-entry values are all plausible.

Here's some sample data and evaluation code. It will print out a message each time there's a new optimal (a,b,c) sequence, or when the prediction for the 6th observation changes (just to show that they're not all the same). The highest likelihood for each 6-element sequence, along with predictions and the optimal 6th data point are stored in best_per_span.

First, construct a sample dataset:

import numpy as np
import pandas as pd

dates = pd.date_range(start="01-01-2001", end="31-12-2001", freq='D')
n_obs = len(dates)
n_feat = 3
values = np.random.uniform(-.05, .05, size=n_obs*n_feat).reshape((n_obs,n_feat))
df = pd.DataFrame(values, index=dates)

df.head()
                   0         1         2
2001-01-01  0.020891 -0.048750 -0.027131
2001-01-02  0.013571 -0.011283  0.041322
2001-01-03 -0.008102  0.034088 -0.029202
2001-01-04 -0.019666 -0.005705 -0.003531
2001-01-05 -0.000238 -0.039251  0.029307

Now split into training and test sets:

train_pct = 0.7
train_size = round(train_pct*n_obs)
train_ix = np.random.choice(range(n_obs), size=train_size, replace=False)
train_dates = df.index[train_ix]

train = df.loc[train_dates]
test = df.loc[~df.index.isin(train_dates)]

train.shape # (255, 3)
test.shape # (110, 3)

Fit 3-state HMM on training data:

# hmm throws a lot of deprecation warnings, we'll suppress them.
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore",category=DeprecationWarning)
    # in the most recent hmmlearn we can't import GaussianHMM directly anymore.
    from hmmlearn import hmm

mdl = hmm.GaussianHMM(n_components=3, covariance_type='diag', n_iter=1000)
mdl.fit(train)

Now grid search for optimal 6th (t+1) observation:

# length of O_t
span = 5

# final store of optimal configurations per O_t+1 sequence
best_per_span = []

# update these to demonstrate heterogenous outcomes
current_abc = None
current_pred = None

for start in range(len(test)-span):
    flag = False
    end = start + span
    first_five = test.iloc[start:end].values
    output = []
    for a in np.arange(-0.05,0.05,.01):
        for b in np.arange(-0.05,0.05,.01):
            for c in np.arange(-0.05,0.05,.01):
                sixth = np.array([a, b, c])[:, np.newaxis].T
                all_six = np.append(first_five, sixth, axis=0)
                output.append((mdl.decode(all_six), (a,b,c)))

    best = max(output, key=lambda x: x[0][0])

    best_dict = {"start":start,
                 "end":end,
                 "sixth":best[1],
                 "preds":best[0][1],
                 "lik":best[0][0]}  
    best_per_span.append(best_dict)

    # below here is all reporting
    if best_dict["sixth"] != current_abc:
        current_abc = best_dict["sixth"]
        flag = True
        print("New abc for range {}:{} = {}".format(start, end, current_abc))

    if best_dict["preds"][-1] != current_pred:
        current_pred = best_dict["preds"][-1]
        flag = True
        print("New pred for 6th position: {}".format(current_pred))

    if flag:
        print("Test sequence StartIx: {}, EndIx: {}".format(start, end))
        print("Best 6th value: {}".format(best_dict["sixth"]))
        print("Predicted hidden state sequence: {}".format(best_dict["preds"]))
        print("Likelihood: {}\n".format(best_dict["nLL"]))

Example reporting output while loop is running:

New abc for range 3:8 = [-0.01, 0.01, 0.0]
New pred for 6th position: 1
Test sequence StartIx: 3, EndIx: 8
Best 6th value: [-0.01, 0.01, 0.0]
Predicted hidden state sequence: [0 2 2 1 0 1]
Likelihood: 35.30144407374163

New abc for range 18:23 = [-0.01, -0.01, -0.01]
New pred for 6th position: 2
Test sequence StartIx: 18, EndIx: 23
Best 6th value: [-0.01, -0.01, -0.01]
Predicted hidden state sequence: [0 0 0 1 2 2]
Likelihood: 34.31813078939214

Example output from best_per_span:

[{'end': 5,
  'lik': 33.791537281734904,
  'preds': array([0, 2, 0, 1, 2, 2]),
  'sixth': [-0.01, -0.01, -0.01],
  'start': 0},
 {'end': 6,
  'lik': 33.28967307589143,
  'preds': array([0, 0, 1, 2, 2, 2]),
  'sixth': [-0.01, -0.01, -0.01],
  'start': 1},
 {'end': 7,
  'lik': 34.446813870838156,
  'preds': array([0, 1, 2, 2, 2, 2]),
  'sixth': [-0.01, -0.01, -0.01],
  'start': 2}]

Aside from the reporting element, this isn't a significant change to your initial approach, but it does seem to work as expected, without maxing out each time.

Decoding sequences in a GaussianHMM

1 Answers1