I want to use an LSTM to predict the next element in a sequence, but I am having trouble wrapping my head around how to formulate the problem, and, more specifically, how to structure the data and present it to the mdel. I should say that I am fairly new to LSTMs - I have followed the Coursera course on sequence models a while ago, but beyond the exercises there have never actually gotten around to try things out. As a refresher I've been reading some of the (generally very helpful) Machine Learning Mastery blog posts, such as this one, this one, and this one - but they have only added to my confusion. It seems that this question is related to mine, but the same applies: I keep getting confused when I try to translate it to my own problem.
Let me start out by describing my problem. I have 150 sequences of items of varying length (the shortest sequence contains about 100 items; the longest about 700). Each item can be represented by up to four features. I want to experiment with which features would work best; for the sake of the example, let's say I have two features representing each item.
What I want to do is to predict the next item given an n-long sequence of previous items: i.e., predict the item at time step t given the items at time steps [t-n, ..., t-2, t-1].
As mentioned, there are two (related) things that I am struggling with:
- How to structure the data.
- How to feed the data to the LSTM.
Let's start with point 1. As mentioned above, my initial data consists of 150 sequences of varying length. Each item i in each sequence is represented by two features: [f0_i, f1_i]
. Let's say that the first sequence contains 100 items; this gives the following picture:
[[f0_0, f1_0], [f0_1, f1_1], [f0_2, f1_2], [f0_3, f1_3], [f0_4, f1_4], ... [f0_99, f1_99]]
So far, so good. Now suppose that I want to use a history of three time steps (previous items) to predict the next item - does this mean that I have to restructure my training data to accommodate for this? I.e., do I need to cut up each of the 150 sequences into multiple time-steps-sized subsequences that I then collect in one superlist, like this (example for the first sequence):
X = [
# sequence 0, cut up
[[f0_0, f1_0], [f0_1, f1_1], [f0_2, f1_2]], # items 0-2 = training sample 0
[[f0_1, f1_1], [f0_2, f1_2], [f0_3, f1_3]], # items 1-3 = training sample 1
[[f0_2, f1_2], [f0_3, f1_3], [f0_4, f1_4]], # items 2-4 = training sample 2
...
[[f0_96, f1_96], [f0_97, f1_97], [f0_98, f1_98]] # items 96-98 = training sample 99
# sequence 1, cut up
...
# sequence 149, cut up
]
y = [
# labels for sequence 0, cut up
[f0_3, f1_3], # item 3 = target for training sample 0
[f0_4, f1_4], # item 4 = target for training sample 1
[f0_5, f1_5], # item 5 = target for training sample 2
...
[f0_99, f1_99] # item 99 = target for training sample 96
# labels for sequence 1, cut up
...
# labels for sequence 49, cut up
]
... where each element in X
is a sample, and each element in y
its target? (The whole list is later split into the training and the test set.)
Or can I just add each complete sequence at once to this superlist, and give that as input (leaving out the last item in X
as it has no y
), like this (example for the first sequence):
X = [
# sequence 0 = training sample 0
[[f0_0, f1_0], [f0_1, f1_1], [f0_2, f1_2], ..., [f0_98, f1_98]]
# sequence 1 = training sample 1
# sequence 2 = training sample 2
...
# sequence 149 = training sample 149
]
y = [
# sequence 0 = targets 0
[[f0_1, f1_1], [f0_2, f1_2], [f0_3, f1_3], ..., [f0_99, f1_99]]
# sequence 1 = targets 1
# sequence 2 = targets 2
...
# sequence 149 = targets 149
]
In the second case I get of course far fewer sample sequences (150), but they are longer. However, the second case breaks the code downstream (see below), as model.fit()
expects a 2D array for the y
data. But how can y
be a 2D array here?
Now to point 2. Here is my code, using X
and y
as described above (this case is for a stateful model, hence the for
loop around the call to fit()
):
batch_size = 1
time_steps = 3
num_features = len(train_X[0][0]) # = 2
epochs = 50
model = Sequential()
model.add(LSTM(10, batch_input_shape=(batch_size, time_steps, num_features), stateful=True))
model.add(Dropout(dropout, seed=1))
model.add(Dense(num_features))
model.compile(loss='mean_squared_error', optimizer='Adam', metrics=['MSE'])
for i in range(epochs):
history = model.fit(X, y, epochs=1, validation_split=0.2, batch_size=batch_size, verbose=2, shuffle=False)
model.reset_states()
It seems straightforward that I just set time_steps
and num_features
to 3 and 2, respectively. But what do I give as X
and y
?
- Should I use the preformatted data, which has the time steps 'encoded' in it, and where
X
has shape (total_num_subsamples,time_steps
,num_features
) andy
shape (total_num_subsamples,num_features
)? - Or should I use the simpler format, where
X
has shape (150, sequence_length, 2) andy
some 2D shape? Will the setting of thetime_steps
option in theLSTM
layer take care of looking back the specified number of time steps here?
Or... am I doing it wrong altogether?
Lastly, suppose that I am mainly interested in predicting feature 0 at time step t. This feature has a limited number of values, i.e., it is one of multiple classes. In this case, would it make sense to represent each element of y
as a one-hot vector encoding the class? Or would the LSTM not be able to learn this (because a different format is used for the item at time step t than for the items at the previous time steps), and would I have to represent all items, also in X
, as one-hot vectors?