How to prepare data for stateful LSTM in Keras?

Question

I would like to develop a time series approach for binary classification, with stateful LSTM in Keras

Here is how my data look. I got a lot , say N, recordings. Each recording consists in 22 time series of length M_i(i=1,...N). I want to use a stateful model in Keras but I don't know how to reshape my data, especially about how I should define my batch_size.

Here is how I proceeded for stateless LSTM. I created sequences of length look_back for all the recordings so that I had data of size (N*(M_i-look_back), look_back, 22=n_features)

Here is the function I used for that purpose :

def create_dataset(feat,targ, look_back=1):
    dataX, dataY = [], []
#     print (len(targ)-look_back-1)
    for i in range(len(targ)-look_back):
        a = feat[i:(i+look_back), :]
        dataX.append(a)
        dataY.append(targ[i + look_back-1])
    return np.array(dataX), np.array(dataY)

where feat is the 2-D data array of size (n_samples, n_features) (for each recording) and targ is the target vector.

So, my question is, based on the data explained above, how to reshape the data for a stateful model and take into account the batch notion ? Are there precautions to take ?

What I want to do is being able to classify each time_step of each recording as seizure/not seizure.

EDIT : Another problem I thought about is : I have recordings that contain sequences of different lenghts. My stateful model could learn long_term dependencies on each of the recording, so that means batch_size differents from one recording to another... How to deal with that ? Won't it cause generalization trouble when tested on completely different sequences (test_set) ?

Thanks

Any special reason to use stateful? (It's not different from a regular layer in its calculations, it's also not suited for lookback windows). — Daniel Möller, Aug 29 '18 at 11:53
Mainly to be able to learn long-term dependencies for each recording — MysteryGuy, Aug 29 '18 at 11:56
My major problem is how to reshape the data so that I got coherent batch such as in your brillant answer here : https://stackoverflow.com/questions/38714959/understanding-keras-lstms. There is another problem I just thought about, I edit my question — MysteryGuy, Aug 29 '18 at 12:00
A `stateful=False` learns exactly the same thing a `stateful=True` does. There is no difference. The only difference is: `stateful=False` thinks "this batch of inputs is independent from the previous batch of inputs", while the `stateful=True` things "this batch of inputs are the same sequences as the last batch of inputs and I'll assume there was no interruption". — Daniel Möller, Aug 29 '18 at 12:00
One: if you can't fit an entire sequence (regarding length in steps) in a single batch. Two: if you want something like predicting the future by taking the outputs as inputs as if you never stopped the processing. — Daniel Möller, Aug 29 '18 at 12:07
I think the name "stateful" was chosen very badly. The word "stateless" suggests there is no memory state, which is not true. It should be called something like `keep_states=True` (as don't reset states after each batch). — Daniel Möller, Aug 29 '18 at 12:10
But stateless requires to introduce reducancy in sequences (for example [1,2,3,4,5] then [2,3,4,5,6], [3,4,5,6,7]...) while stateful allows to split the time steps inside multiple batches without this kind of reduncancy, right ? — MysteryGuy, Aug 29 '18 at 12:14
Could you try to elaborate an answer based on my question, just to be able to properly try stateful and have an idea on how reshape the data (especially batch_size) for it ? I guess I would have to make a loop on `train_on_batch` as the number of batches 's likely going to differ from one recording another. Thanks very much in advance ! — MysteryGuy, Aug 29 '18 at 12:17
That redundancy (sliding windows) is not necessary, I have no idea why they teach that. That is totally against the "long term dependencies" an LSTM can learn. — Daniel Möller, Aug 29 '18 at 12:22
I'm really sure that both types of LSTM layers do and learn "exactly the same". The only difference is really "discarding or not" the states between batches. If you discard the states, each new batch will be seen as new sequences. If not, each new batch will be seen as new steps of the same sequences. — Daniel Möller, Aug 29 '18 at 12:25
"That redundancy (sliding windows) is not necessary", that means I could cut the signal in non-overlapping segments (say length 50) and feed them inside a stateless LSTM of size `(n_samples/50, 50,22)` (divided n_samples by 50 as create non overlapping sequences) and get good results, right ? — MysteryGuy, Aug 29 '18 at 12:31
I must understand your data better to answer. See my answer below. You don't need to "cut the signal". But "if you cut", then you need stateful=True. (But if you don't have a reason to cut, don't cut). — Daniel Möller, Aug 29 '18 at 12:44
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/179027/discussion-between-mysteryguy-and-daniel-moller). — MysteryGuy, Aug 29 '18 at 12:45

score 7 · Answer 1 · edited Jun 20 '20 at 09:12

I don't think you need a stateful layer for your purpose.

If you want long term learning, simply don't create these sliding windows. Have your data shaped as:

(number_of_independent_sequences, length_or_steps_of_a_sequence, variables_or_features_per_step)

I'm not sure I understand the wording correctly in your question. If a "recording" is like a "movie" or a "song", a "voice clip" or something like that, then:

number of sequences = number of recordings

Following that idea of "recording", the time steps will be "the frames in a vide", or the "samples" (time x sample_rate for 1 channel) in an audio file. (Be careful, "samples" in keras are "sequences/recordings" while "samples" in audio processing are "steps" in keras).

time_steps = number of frames or audio samples

Finally, the number of features/variables. In a movie, it's like RGB channels (3 features), in audio, also the number of channels (2 in stereo). In other kinds of data they may be temperature, pressure, etc.

features = number of variables measured in each step

Having your data shaped like this will work for both stateful = True and False.

These two methods of training are equivalent:

#with stateful=False
model.fit(X, Y, batch_size=batch_size)

#with stateful=True
for start in range(0, len(X), batch_size):
    model.train_on_batch(X[start:start+batch_size], Y[start:start+batch_size])
    model.reset_states()

There might be changes only in the way the optimizers are updated.

For your case, if you can create such input data shaped as mentioned and you're not going to recursively predict the future, I don't see a reason to use stateful=True.

Classifying every step

For classifying every step, you don't need to create sliding windows, it's also not necessary to use stateful=True.

Recurrent layers have an option to output all time steps, by setting return_sequences=True.

If you have an input with shape (batch, steps, features), you will need targets with shape (batch, steps, 1), which is one class per step.

In short, you need:

LSTM layers with return_sequences=True
X_train with shape (files, total_eeg_length, 22)
Y_train with shape (files, total_eeg_length, 1)

Hint: as LSTMs never classify the beginning very well, you can try using Bidirectional(LSTM(....)) layers.

Inputs with different lengths

For using inputs with different lengths, you need to set input_shape=(None, features). Considering our discussion in the chat, features = 22.

You can then:

Load each EEG individually:
- X_train as (1, eeg_length, 22)
- Y_train as (1, eeg_length, 1)
- Train each EEG separately with model.train_on_batch(array, targets).
- You will need to manage epochs manually and use test_on_batch for validation data.
Pad the shorter EEGs with zeros or another dummy value until they all reach the max_eeg_length and use:
- a Masking layer at the beginning of the model to discard the steps with the dummy value.
- X_train as (eegs, max_eeg_length, 22)
- Y_train as (eegs, max_eeg_length, 1)
- You can train with a regular model.fit(X_train, Y_train,...)

What if I have multiclass classification ? Just need to set the last parameter of `Y_train` to `nb_classes` or the last parameter (set to 1 in your answer) is just the number of variables to "target", whatever the number of classes are available for each target ? Moreover, I guess that validation data and test_data should be the same size than train data ? — MysteryGuy, Aug 30 '18 at 07:12
Moreover, at the end of my model, how can I do to get metrics such as confusion_matrix, ...? I am not sure on how doing the sum over every test_data... — MysteryGuy, Aug 30 '18 at 08:21
For, multiclasses the target shape will be (batch, length, classes). There is no need to have "same size". You need "same number of classes", though. — Daniel Möller, Aug 30 '18 at 12:25
I get a little problem : my dataset is very imbalanced in the sense that each recording generally contains much higher `0` than `1`. So when I apply the method you proposed where the overall sequence is passed at once, I get poor results when predictions are made on new data... Is there a way of better balancing the data while still benefiting from long sequences ? — MysteryGuy, Aug 30 '18 at 14:55

How to prepare data for stateful LSTM in Keras?

1 Answers1

Classifying every step

Inputs with different lengths

Linked