Train and predict on variable length sequences

Question

Sensors (of the same type) scattered on my site are manually reporting on irregular intervals to my backend. Between reports the sensors aggregate events and report them as a batch.

The following dataset is a collection of sequence events data, batch collected. For example sensor 1 reported 2 times. On the first batch 2 events and on the second batch 3 events, while sensor 2 reported 1 time with 3 events.

I would like to use this data as my train data X

sensor_id	batch_id	timestamp	feature_1	feature_n
1	1	2020-12-21T00:00:00+00:00	0.54	0.33
1	1	2020-12-21T01:00:00+00:00	0.23	0.14
1	2	2020-12-21T03:00:00+00:00	0.51	0.13
1	2	2020-12-21T04:00:00+00:00	0.23	0.24
1	2	2020-12-21T05:00:00+00:00	0.33	0.44
2	1	2020-12-21T00:00:00+00:00	0.54	0.33
2	1	2020-12-21T01:00:00+00:00	0.23	0.14
2	1	2020-12-21T03:00:00+00:00	0.51	0.13

My target y, is a score calculated from all the events collected by a sensor:
I.E socre_sensor_1 = f([[batch1...],[batch2...]])

sensor_id	final_score
1	0.8
2	0.6

I would like to predict y each time a batch is collected, I.E 2 predictions for a sensor with 2 reports.

LSTM model:
I've started with an LSTM model, since I'm trying to predict on a time-series of events. My first thought was to select a fixed size input and to zero pad the input when the number of events collected is smaller than the input size.Then mask the padded value:

model.add(Masking(mask_value=0., input_shape=(num_samples, num_features)))

For example:

sensor_id	batch_id	timestamp	feature_1	feature_n
1	1	2020-12-21T00:00:00+00:00	0.54	0.33
1	1	2020-12-21T01:00:00+00:00	0.23	0.14

Would produce the following input if selected length is 5:

[
 [0.54, 0.33],
 [0.23, 0.14],
 [0,0],
 [0,0],
 [0,0]
]

However, the variance of number of events per sensor report in my train data is large, one report could collect 1000 events while the other one can collect 10. So if I'm selecting the average size (let's say 200), some inputs would be with a lot of padding, while other would be truncated and data will be lost.

I've heard about ragged tensors, but I'm not sure it fit my use case. How would one approach such a problem?

You could re-frame your problem to have a fixed sequence length. Instead of trying to fit your net work on `(batch, seq (between 10 and 1000), features)` you could try `(batch, 1, features)`. Your variable number of events would be passed in the batch dimensions and would no longer impact your model during training. — Yoan B. M.Sc, Dec 21 '20 at 15:47
@YoanB.M.Sc thank you for your reply, can you please elaborate or maybe add code? — Shlomi Schwartz, Dec 21 '20 at 20:31
See answer below. don't forget to reshape your label as well to make sure each time step has a matching label. — Yoan B. M.Sc, Dec 21 '20 at 20:56
Variable sized input sequence are quite common and can be solved by specifying input shape for the LSTM as `none`. You just have to ensure that you pass the same length sequences in a given batch; that's the trick. So if you pass each of the set of events with a batch size of 1 in such a network, you can handle variable sized sequences without the pain of padding / truncating. Check my answer for more details. — Akshay Sehgal, Dec 23 '20 at 21:26
Also, just a quick question, is your output variable length as well? or its fixed length, irrespective of the length of the input sequences? I could modify my code example for that case as well, but IIUC you have fixed shape outputs and variable length inputs. — Akshay Sehgal, Dec 23 '20 at 21:56

score 1 · Answer 1 · answered Dec 21 '20 at 20:55

I don't have the specific of your model, but TF implementation of LSTM usually expect (batch, seq, features) as input.

Now lest assume this is one of your batch_id:

data = np.zeros((15,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

You could reshape it with (1, 15, 5) and feed it to the model, but anytime your batch_id length vary your sequence length will vary too and your model expect a fix sequence.

Instead you could reshape your data before training so that the batch_id length is passed as the batch size:

data = data[:,np.newaxis,:] 

array([[[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.]]])

Same data, with shape (15, 1, 5) but your model would now be looking at a fix length of 1 and the number of sample would vary.

Make sure to reshape your label as well.

To my knowledge, RNN and LSTM being applied for each time steps and state being reset between bacthes only this should not impact the model behavior.

Thanks, that is one direction I can take. I'm leaving the question open for now to see some alternative. — Shlomi Schwartz, Dec 22 '20 at 16:44
@Yoan, interesting method. Just one correction in this line - `but anytime your batch_id length vary your sequence length will vary too and your model expect a fix sequence.` The model doesn't expect a fixed sequence length, just a batch expects it. This is because gradient updates need to be made once per batch. If you have a single sample per batches, the model (LSTM) doesn't expect any length for a sequence unless you specify it as an input shape explicitly. — Akshay Sehgal, Dec 23 '20 at 22:32

Akshay Sehgal · Answer 2 · 2020-12-23T23:54:30.273

Working with variable-sized input sequences is quite simple. While there is a restriction of having the same sized sequence within each batch, there is NO RESTRICTION of having variable-sized sequences between the batches. Using this to your advantage, you can simply set the input sequence for the LSTM to (None, features) and use batch_size as 1.

Let's create a generator that generates variable-length sequences of 2 features and a random float score that you seek as a function of these sequences, similar to your input data for the sensors.

#Infinitely creates batches of dummy data
def generator():
    while True:
        length = np.random.randint(2, 10) #Variable length sequences
        x_train = np.random.random((1, length, 2)) #batch, seq, features
        y_train = np.random.random((1,1)) #batch, score
        yield x_train, y_train

next(generator())

#x.shape = (1,4,2), y.shape = (1,1)
(array([[[0.63841991, 0.91141833],
         [0.73131801, 0.92771373],
         [0.61298585, 0.6455549 ],
         [0.25893925, 0.40202978]]]),
 array([[0.05934613]]))

Above is an example of a 4 length sequence created by the generator while the next is a 9 length one.

next(generator())

#x.shape = (1,9,2), y.shape = (1,1)
(array([[[0.76006158, 0.27457503],
         [0.57739596, 0.75416962],
         [0.03029365, 0.29339812],
         [0.93866829, 0.79137367],
         [0.52739961, 0.11475738],
         [0.85832651, 0.19247399],
         [0.37098216, 0.48703114],
         [0.95846681, 0.15507787],
         [0.86945015, 0.70949593]]]),
 array([[0.02560889]]))

Now, let's create an LSTM based neural net that can work with these variable-sized sequences for each batch.

from tensorflow.keras import layers, Model, utils

inp = layers.Input((None, 2))
x = layers.LSTM(10, return_sequences=True)(inp)
x = layers.LSTM(10)(x)
out = layers.Dense(1)(x)

model = Model(inp, out)
utils.plot_model(model, show_layer_names=False, show_shapes=True)

Training these with a batch size of 1 -

model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(generator(), steps_per_epoch=100, epochs=10, batch_size=1)
#Steps_per_epoch is to stop the generator from generating infinite batches of data per epoch.

Epoch 1/10
100/100 [==============================] - 1s 5ms/step - loss: 1.5145
Epoch 2/10
100/100 [==============================] - 0s 5ms/step - loss: 0.7435
Epoch 3/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7885
Epoch 4/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7384
Epoch 5/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7139
Epoch 6/10
100/100 [==============================] - 0s 5ms/step - loss: 0.7462
Epoch 7/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7173
Epoch 8/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7116
Epoch 9/10
100/100 [==============================] - 0s 4ms/step - loss: 0.6875
Epoch 10/10
100/100 [==============================] - 0s 4ms/step - loss: 0.7153

This is how you can work with variable-sized sequences as inputs. Padding/masking is only necessary for sequences that are part of the same batch.

Now, you could create a generator for your input data that generates one sequence of events as input to the model at one time, in which case you do not need to specify the batch_size explicitly since you are generating one sequence at a time already.

Do not specify the batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).

Or you could use the ragged tensors you were mentioning and provide a batch_size of 1 for each sequence. Personally, I prefer working with generators for training data as it gives you a lot more flexibility in pre-processing as well.

Interestingly, you could optimize this code further, but bundling batches of same length sequences together in a batch(es) and then passing a variable batch size. This would help if you have tons of data and can't afford to run a batch_size of 1 for each gradient update!

Another word of caution! If your sequences are extremely long, then I would recommend using Truncated Backpropagation through time (TBPTT) (find details here).

Hope this solves what you are looking for.

score 0 · Answer 3 · answered Dec 23 '20 at 23:40

There's no need to feed everything to the same LSTM, and when "variance of number of events per sensor report in my train data is large", a subnetwork approach should work better.

If you have N sensors with large variance on sampling interval, make N Inputs and LSTMs (in parallel), then concatenate features at a later stage. To avoid making many LSTMs, group sensors by expected sample lengths, e.g. 100-150, 900-1100, etc, and pad within each group to maximum respective length.

Avoid padding too much, as (the typical right-padding) comes with the serious disadvantage of shrinking the "learning signal" (BPTT unrolls right-to-left, so if "right" is mostly zeros, most "learning" goes into ignoring them rather than feature extraction). Your batch_size is as permissive as ability to group variances per above; it's thus about finding the right balance for your data (just do not batch norm with batch_size<32, instead prefer batch renormalization or other small-batch alts).

Lastly, for such sparse data, I'd recommend an attention mechanism (various implementations available).

Multi-branch LSTM example:

from tensorflow.keras.layers import Input, LSTM, Dense, concatenate
from tensorflow.keras.models import Model

ipt1 = Input(shape=(None, 2))  # 2 sensors grouped, variable input length
x1   = LSTM(4)(ipt1)
ipt2 = Input(shape=(None, 3))  # 3 sens, var
x2   = LSTM(6)(ipt2)

xc   = concatenate([x1, x2])
out  = Dense(1, activation='sigmoid')(xc)

model = Model([ipt1, ipt2], out)
model.compile('adam', 'binary_crossentropy')

x1 = np.random.randn(2, 4, 2)
x2 = np.random.randn(2, 5, 3)
y  = np.random.randint(0, 2, 2)

model.fit([x1, x2], y)

score 0 · Answer 4 · answered Dec 27 '20 at 08:16

Ragged tensors is the way to go:

Ragged tensors are the TensorFlow equivalent of nested variable-length lists. They make it easy to store and process data with non-uniform shapes

You can create your ragged tensor in many ways, one is from nested lists.

import tensorflow as tf
# your first sensor data from your example above
data = [[[0.54 , 0.33],[0.23 , 0.14]],[[0.51,0.13],[0.23,0.24],[0.33, 0.44]]]
X = tf.ragged.constant(data)

<tf.RaggedTensor [[[0.5400000214576721, 0.33000001311302185],   [0.23000000417232513, 0.14000000059604645]], [[0.5099999904632568, 0.12999999523162842], [0.23000000417232513, 0.23999999463558197], [0.33000001311302185, 0.4399999976158142]]]>

Then in your model your first layer should be an input with ragged=True and a shape of [None , number_of_features]:

model = Sequential()
model.add(Input(shape=[None,2], dtype=tf.float32, ragged=True))
model.add(LSTM(16, activation='tanh'))
...

Train and predict on variable length sequences

4 Answers4

Linked