How do I present the data to a Keras LSTM?

Question

I want to use an LSTM to predict the next element in a sequence, but I am having trouble wrapping my head around how to formulate the problem, and, more specifically, how to structure the data and present it to the mdel. I should say that I am fairly new to LSTMs - I have followed the Coursera course on sequence models a while ago, but beyond the exercises there have never actually gotten around to try things out. As a refresher I've been reading some of the (generally very helpful) Machine Learning Mastery blog posts, such as this one, this one, and this one - but they have only added to my confusion. It seems that this question is related to mine, but the same applies: I keep getting confused when I try to translate it to my own problem.

Let me start out by describing my problem. I have 150 sequences of items of varying length (the shortest sequence contains about 100 items; the longest about 700). Each item can be represented by up to four features. I want to experiment with which features would work best; for the sake of the example, let's say I have two features representing each item.

What I want to do is to predict the next item given an n-long sequence of previous items: i.e., predict the item at time step t given the items at time steps [t-n, ..., t-2, t-1].

As mentioned, there are two (related) things that I am struggling with:

How to structure the data.
How to feed the data to the LSTM.

Let's start with point 1. As mentioned above, my initial data consists of 150 sequences of varying length. Each item i in each sequence is represented by two features: [f0_i, f1_i]. Let's say that the first sequence contains 100 items; this gives the following picture:

[[f0_0, f1_0], [f0_1, f1_1], [f0_2, f1_2], [f0_3, f1_3], [f0_4, f1_4], ... [f0_99, f1_99]]

So far, so good. Now suppose that I want to use a history of three time steps (previous items) to predict the next item - does this mean that I have to restructure my training data to accommodate for this? I.e., do I need to cut up each of the 150 sequences into multiple time-steps-sized subsequences that I then collect in one superlist, like this (example for the first sequence):

X = [
     # sequence 0, cut up
     [[f0_0, f1_0], [f0_1, f1_1], [f0_2, f1_2]], # items 0-2 = training sample 0 
     [[f0_1, f1_1], [f0_2, f1_2], [f0_3, f1_3]], # items 1-3 = training sample 1 
     [[f0_2, f1_2], [f0_3, f1_3], [f0_4, f1_4]], # items 2-4 = training sample 2 
     ...
     [[f0_96, f1_96], [f0_97, f1_97], [f0_98, f1_98]] # items 96-98 = training sample 99
     # sequence 1, cut up
     ...
     # sequence 149, cut up 
]

y = [
     # labels for sequence 0, cut up
     [f0_3, f1_3], # item 3 = target for training sample 0
     [f0_4, f1_4], # item 4 = target for training sample 1
     [f0_5, f1_5], # item 5 = target for training sample 2
     ...
     [f0_99, f1_99] # item 99 = target for training sample 96 
     # labels for sequence 1, cut up
     ...
     # labels for sequence 49, cut up
]

... where each element in X is a sample, and each element in y its target? (The whole list is later split into the training and the test set.)

Or can I just add each complete sequence at once to this superlist, and give that as input (leaving out the last item in X as it has no y), like this (example for the first sequence):

X = [
     # sequence 0 = training sample 0 
     [[f0_0, f1_0], [f0_1, f1_1], [f0_2, f1_2], ..., [f0_98, f1_98]]  
     # sequence 1 = training sample 1
     # sequence 2 = training sample 2
     ...
     # sequence 149 = training sample 149
]

y = [
     # sequence 0 = targets 0
     [[f0_1, f1_1], [f0_2, f1_2], [f0_3, f1_3], ..., [f0_99, f1_99]] 
     # sequence 1 = targets 1
     # sequence 2 = targets 2
     ...
     # sequence 149 = targets 149
]

In the second case I get of course far fewer sample sequences (150), but they are longer. However, the second case breaks the code downstream (see below), as model.fit() expects a 2D array for the y data. But how can y be a 2D array here?

Now to point 2. Here is my code, using X and y as described above (this case is for a stateful model, hence the for loop around the call to fit()):

batch_size = 1
time_steps = 3
num_features = len(train_X[0][0]) # = 2
epochs = 50

model = Sequential() 
model.add(LSTM(10, batch_input_shape=(batch_size, time_steps, num_features), stateful=True))
model.add(Dropout(dropout, seed=1))
model.add(Dense(num_features))

model.compile(loss='mean_squared_error', optimizer='Adam', metrics=['MSE'])
for i in range(epochs):
    history = model.fit(X, y, epochs=1, validation_split=0.2, batch_size=batch_size, verbose=2, shuffle=False)
    model.reset_states()

It seems straightforward that I just set time_steps and num_features to 3 and 2, respectively. But what do I give as X and y?

Should I use the preformatted data, which has the time steps 'encoded' in it, and where X has shape (total_num_subsamples, time_steps, num_features) and y shape (total_num_subsamples, num_features)?
Or should I use the simpler format, where X has shape (150, sequence_length, 2) and y some 2D shape? Will the setting of the time_steps option in the LSTM layer take care of looking back the specified number of time steps here?

Or... am I doing it wrong altogether?

Lastly, suppose that I am mainly interested in predicting feature 0 at time step t. This feature has a limited number of values, i.e., it is one of multiple classes. In this case, would it make sense to represent each element of y as a one-hot vector encoding the class? Or would the LSTM not be able to learn this (because a different format is used for the item at time step t than for the items at the previous time steps), and would I have to represent all items, also in X, as one-hot vectors?

It seems like I have same problem(except that I have lot of data, 2 millio +). Have you figured out how to frame the problem? — A.B, Nov 09 '20 at 13:39
@A.B Sorry for the late reply - I only see this now. No, I abandoned the project for a while. When I get back to it I will also come back to it here! — rdv, Jul 25 '21 at 20:36

How do I present the data to a Keras LSTM?

0 Answers0