Multivariate binary sequence prediction with LSTM

Question

I'm working on a sequence forecasting problem and I don't have much experience in this area, so some of the below questions might be naive.

FYI: I've created a follow-up question with a focus on CRFs here

I have the following problem:

I would like to forecast a binary sequence for multiple, non-independent variables.

Inputs:

I have a dataset with the following variables:

Timestamps
Groups A and B
Binary signal corresponding to each group at a particular timestamp

Additionally, suppose the following:

We can extract additional attributes from the timestamps (e.g. hour of day) which can be used as external predictors
We believe that groups A and B are not independent therefore it might be optimal to model their behaviour jointly

binary_signal_group_A and binary_signal_group_B are the 2 non-independent variables that I would like to forecast using (1) their past behaviour and (2) additional information extracted from each timestamp.

What I've done so far:

# required libraries
import re
import numpy as np
import pandas as pd
from keras import Sequential
from keras.layers import LSTM

data_length = 18  # how long our data series will be
shift_length = 3  # how long of a sequence do we want

df = (pd.DataFrame  # create a sample dataframe
    .from_records(np.random.randint(2, size=[data_length, 3]))
    .rename(columns={0:'a', 1:'b', 2:'extra'}))
# NOTE: the 'extra' variable refers to a generic predictor such as for example 'is_weekend' indicator, it doesn't really matter what it is

# shift so that our sequences are in rows (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
    for s in shift_range:
        if not (c == 'extra' and s > 0):
            charge = 'next' if s > 0 else 'last'  # 'next' variables is what we want to predict
            formatted_s = '{0:02d}'.format(abs(s))
            new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
            df[new_var] = df[c].shift(s)

# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head()  # check it out

#   a_last_03  a_last_02      ...        extra_last_02  extra_last_01
# 3          0          1      ...                    0              1
# 4          1          0      ...                    0              0
# 5          0          1      ...                    1              0
# 6          0          0      ...                    0              1
# 7          0          0      ...                    1              0
# [5 rows x 15 columns]

# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
    response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]

# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have

response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
    current_array = df[[z for z in df.columns if z.startswith(c)]].values
    response_array_list.append(current_array)

# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))

# feed into the model
model = Sequential()
model.add(LSTM(8, input_shape=(predictor_array.shape[1],predictor_array.shape[2]), return_sequences=True))  # the number of neurons here can be anything
model.add(LSTM(2, return_sequences=True))  # should I use an activation function here? the number of neurons here must be equal to the # of groups we are predicting
model.summary()

# _________________________________________________________________
# Layer (type)                 Output Shape              Param #   
# =================================================================
# lstm_62 (LSTM)               (None, 3, 8)              384       
# _________________________________________________________________
# lstm_63 (LSTM)               (None, 3, 2)              88        
# =================================================================
# Total params: 472
# Trainable params: 472
# Non-trainable params: 0

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # is it valid to use crossentropy and accuracy as metric?
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict_classes(predictor_array)  # not gonna worry about train/test split here
model_preds.shape  # should return (12, 3, 2) or (# of records, # of timestamps, # of groups which are a and b)

# (12, 3)

model_preds
# array([[1, 0, 0],
#        [0, 0, 0],
#        [1, 0, 0],
#        [0, 0, 0],
#        [1, 0, 0],
#        [0, 0, 0],
#        [0, 0, 0],
#        [0, 0, 0],
#        [0, 0, 0],
#        [0, 0, 0],
#        [1, 0, 0],
#        [0, 0, 0]])

Questions:

The main question here is this: how do I get this working so that the model would forecast the next N sequences for both groups?

Additionally, I would like to ask the following questions:

Groups A and B are expected to be cross-correlated, however, is it valid to attempt to output both A and B sequences by a single model or should I fit 2 separate models, one predicting A, the other one predicting B but both using historical A and B data as inputs?
While my last layer in the model is an LSTM of shape (None, 3, 2), the prediction output is of shape (12, 3) when I would have expected it to be (12, 2) -- am I doing something wrong here and if so, how would I fix this?
As far as the output LSTM layer is concerned, would it be a good idea to use an activation function here, such as sigmoid? Why/why not?
Is it valid to use a classification type loss (binary cross-entropy) and metrics (accuracy) for optimising a sequence?
Is an LSTM model an optimal choice here? Does anyone think that a CRF or some HMM-type model would work better here?

Many thanks!

Mitiku · Accepted Answer · 2020-04-19T13:42:31.710

I will answer all question sequentially

how do I get this working so that the model would forecast the next N sequences for both groups?

I would suggest two modifications to your model.
The first is using sigmoid activation for the last layer.

Why?? Consider binary cross entropy loss function (I borrowed the equation from here)

Where L is calculated loss, p is network prediction and y is target values.

The Loss is defined for . If p is outside of this open interval range then the loss is undefined. The default activation of lstm layer in keras is tanh and it's output range is (-1, 1). This implies that the output of the model is not suitable for binary cross-entropy loss. If you try to train the model you might end up getting nan for loss.

The second modification (is part of the first modification) either add sigmoid activation before the last layer. For this, you have three options.

Add dense layer with sigmoid activation between your output and last lstm layer.
Or change the activation of the lstm layer to sigmoid.
Or add Activation layer with sigmoid activation after the output layer.

Even though all cases would work, I would suggest using dense layer with sigmoid activation because it almost always works better. Now the model with suggested changes would be

model = Sequential()
model.add(LSTM(8, input_shape=(predictor_array.shape[1],predictor_array.shape[2]), return_sequences=True))  
model.add(LSTM(2, return_sequences=True)) 
model.add(TimeDistributed(Dense(2, activation="sigmoid")))
model.summary()

... is it valid to attempt to output both A and B sequences by a single model or should I fit 2 separate models ... ?

Ideally, both cases could work. But the latest studies such a this one show that the former case(where you use a single model for both groups) tends to perform better. The approach is generally called as Multi Task Learning. The idea behind Multi-Task learning is very broad, for simplicity, it can be thought of as adding inductive bias by forcing the model to learn hidden representations that are common for multiple tasks.

... the prediction output is of shape (12, 3) when I would have expected it to be (12, 2) -- am I doing something wrong here ... ?

You are getting this because you are using predict_classes method. Unlike predict method, predict_classes method returns the maximum index of channels' axis(in your case third index). As I explained above if you use sigmoid activation for the last layer and replaced predict_classes with predict, you will get what you are expecting.

As far as the output LSTM layer is concerned, would it be a good idea to use an activation function here, such as sigmoid? Why/why not?

I hope I've explained this above. The answer is YES.

Is it valid to use a classification type loss (binary cross-entropy) and metrics (accuracy) for optimizing a sequence?

Since your targets are binary signals(the distribution is Bernoulli distribution), Yes it is valid to use binary loss and accuracy metrics. This answer gives more details on why binary cross-entropy is valid for this type of target variables.

Is an LSTM model an optimal choice here? Does anyone think that a CRF or some HMM-type model would work better here?

This depends on the data available and the complexity of the network you choose. CRF and HMM networks are simple and work better if the available data is small. But if the available dataset is large, LSTM will almost always outperform both CRF and HMM. My suggestion is if you have a lot of data use LSTM. But if either you have small data or looking for simple models you can use CRF or HMM.

Thank you for your response, I couldn't have hoped for a more detailed answer! A small correction: the activation function should be a dense parameter not TimeDistributed. A follow up question regarding CRF, if you don't mind: I've tried using its implementation in keras contrib package and it works but not with the dense layer (admittedly, I need to read up more on it) but there are other implementations out there even though most others would probably not allow me to address this as a multitask learning problem, so would you recommend the keras contrib implementation without the dense layer? — IVR, Dec 30 '18 at 22:49
Yes, you are right about activation function. I've updated the code. — Mitiku, Dec 31 '18 at 06:05
FYI, I've created a follow up question concerning CRFs [here](https://stackoverflow.com/questions/53987682/multivariate-binary-sequence-prediction-with-crf) — IVR, Dec 31 '18 at 12:46

Multivariate binary sequence prediction with LSTM

1 Answers1

Linked