LSTM many-to-many training in batches of independent examples

Question

I'm still figuring out LSTMs and trying to come up with the optimal and appropriate training routine and data shape.

A time series represents musical notes. Let's call it a song. So I have data in the following form. The series consists of notes that are one-hot encoded. So they have shape (timesteps, features). A copy of this series is made twelve times by transposing (moving up notes of) the series. One song would then take shape (12, timesteps, features). Each of these twelve series should be trained on independently. In addition there are multiple songs that vary in length.

I'd like to train an LSTM such that a prediction is made at every step of a series. So training data of one of the twelve series would be X = series[:-1, :], Y = series[1:, :] and similarly for all twelve versions.

# Example data, numbers not one-hot encoded for brevity
series = [1, 3, 2, 4, 7, 7, 10]
X = [1, 3, 2, 4, 7, 7]
Y = [3, 2, 4, 7, 7, 10]   # Shifted 1 step back

The twelve variations would create a natural batch, as the length does not vary. But my question to you is: can the training be arranged such that these variants are fed to the network as a batch of twelve, but the training is performed as many-to many? (one time step per one prediction)

Currently I have what seems to be a naïve approach for one single example. It feeds the time steps to the network one by one, preserving state in between:

# X = (12 * timesteps, 1, features), Y = (12 * timesteps, features)
model = Sequential()
model.add(LSTM(256, input_shape=(None, X.shape[-1]), batch_size=1, stateful=True))
model.add(Dense(Y.shape[-1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

for epoch in range(10):
    model.fit(X, Y, epochs=1, batch_size=1, shuffle=False)
    model.reset_states()

How might the mentioned training regime be achieved for a single song of twelve variations?

Let me make sure I understand you correctly: you want the predictions for all the 12 variations be generated at once, right? You don't want to generate 12 predictions independently and then aggregate them after prediction? — today, Aug 15 '18 at 09:32
@today I'm sorry if the explanation is lacking. What I mean is that I would be able to train on the 12 variations simultaneously (batch), and for each variation produce a prediction for each time step. Does that answer the question? — Felix, Aug 15 '18 at 09:36
So you want to feed the model with a tensor of shape `(12, n_timesteps, n_features)` and get an output of shape `(12, n_timesteps, n_features)`, right? Essentially, given the first `t` timesteps you want to predict the `t+1` timestep? And each song may have a different length, i.e. `n_timesteps` is different for each song but `n_features` is the same for all of them? — today, Aug 15 '18 at 09:41
@today Yes! Previously I've struggled with the `TimeDistributed` wrapper, it might have something to do with the training, but: I'm pretty inexperienced. — Felix, Aug 15 '18 at 09:48

score 1 · Accepted Answer · answered Aug 15 '18 at 10:14

As you mentioned in your comment you need to wrap a LSTM layer inside TimeDistributed. This way each of the 12 variations will be processed individually. Further, since each feature vector is one-hot encoded we add a Dense layer with a softmax activation as the last layer of our network:

from keras import models, layers

n_features = 20

model_input = layers.Input(shape=(12, None, n_features))
x = layers.TimeDistributed(layers.LSTM(64, return_sequences=True))(model_input)
model_output = layers.Dense(n_features, activation='softmax')(x)

model = models.Model([model_input], [model_output])
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model.summary()

Here is the model summary:

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 12, None, 20)      0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 12, None, 64)      21760     
_________________________________________________________________
dense_1 (Dense)              (None, 12, None, 20)      1300      
=================================================================
Total params: 23,060
Trainable params: 23,060
Non-trainable params: 0
_________________________________________________________________

Note that this model may be very simple for your problem. You may wish to stack more LSTM layers on top of each other and change parameters to get better accuracy depending on the specific problem you are trying to solve (at the end you must experiment!); but it gives you a rough idea of what a model may look like in this scenario. Although it may seem slightly irrelevant, I suggest you to read the Seq2Seq tutorial in Keras official blog to get more ideas in this regard.

As a side note, if you are using a GPU then you can use CuDNNLSTM layer instead of LSTM; it gives much better performance on GPU.

Thank you very much for the answer! This seems very promising, but I'll have to wait until I get home. To pick your brain a bit more: I've seen the `TimeDistributed` wrapper being used for the Dense layer. How does that differ from wrapping the LSTM layer? Many thanks! — Felix, Aug 15 '18 at 10:18
@Felix [`TimeDistributed`](https://keras.io/layers/wrappers/#timedistributed) is not specific to `Dense` layer. It can be used with any layer like Conv2D or LSTM. It essentially applies the wrapped layer on temporal slices of the input (i.e. third dimension onwards). — today, Aug 15 '18 at 10:26
Okay, so are wrapping the LSTM and Dense equal operations, affecting only the training? This seems like a different question altogether, but *damn* there's much to know. — Felix, Aug 15 '18 at 10:29
@Felix Of course NO! One applies LSTM layer on the input and the other applies Dense layer. Don't infer this only by looking at the output shapes! Don't forget that I have added `return_sequences=True` argument to LSTM layer to give us the output of all the timesteps. That's why you see the output shape of LSTM layer to be `(None, 12, None, 64)`. If we haven't done so, the output shape would be `(None, 12, 64)`, i.e. only the last output of LSTM would be returned. — today, Aug 15 '18 at 10:35
Maybe I'll look into it before asking more stupid questions :D In any case, this was very helpful. Thank you. — Felix, Aug 15 '18 at 10:39

LSTM many-to-many training in batches of independent examples

1 Answers1