On training LSTMs efficiently but well, parallelism vs training regime

Question

For a model that I intend to spontaneously generate sequences I find that training it sample by sample and keeping state in between feels most natural. I've managed to construct this in Keras after reading many helpful resources. (SO: Q and two fantastic answers, Macine Learning Mastery 1, 2, 3)

First a sequence is constructed (in my case one-hot encoded too). X and Y are procuded from this sequence by shifting Y forward one time step. Training is done in batches of one sample and one time step.

For Keras this looks something like this:

data = get_some_data()   # Shape (samples, features)
Y = data[1:, :]          # Shape (samples-1, features)
X = data[:-1, :].reshape((-1, 1, data.shape[-1])) # Shape (samples-1, 1, features)

model = Sequential()
model.add(LSTM(256, batch_input_shape=(1, 1, X.shape[-1]), stateful=True))
model.add(Dense(Y.shape[-1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

for epoch in range(10):
    model.fit(X, Y, batch_size=1, shuffle=False)
    model.reset_states()

It does work. However after consulting my Task Manager, it seems it's only using ~10 % of my GPU resources, which are already quite limited. I'd like to improve this to speed up training. Increasing batch size would allow for parallel computations.

The network in its current state presumably "remembers" things even from the start of the training sequence. For training in batches one would need to first set up the sequence and then predict one value - and do this for multiple values. To train on the full sequence one would need to generate data in the shape (samples-steps, steps, features). I imagine it wouldn't be uncommon to have a sequence spanning at least a couple of hundred time steps. So that would mean a huge increase in the data amount.

Between framing the problem a bit differently and requiring more data to be stored in memory, and utilising only a small amount of processing resources, I must ask:

Is my intuition of the natural way of training and statefulness correct?
Are there other downsides to this training with one sample per batch?
Could the utilisation issues be resolved any other way?
Finally, is there an accepted way of performing this kind of training to generate long sequences?

Any help is greatly appreciated, I'm fairly new with LSTMs.

modesitt · Accepted Answer · 2018-08-08T19:55:41.070

I do not know your specific application, however sending only one timestep of data in is surely not a good idea. You should instead, give the LSTM the entire sequence of previously given one-hot vectors (presumably words), and pre-pad (with zeros) if necessary as it appears you are working on sequences of varying length. Consider also using an embedding layer before your LSTM if these are indeed words. Read the documentation carefully.

The utilization of your GPU being low is not a problem. You simply do not have enough data to fully utilize all resources in each batch. Training with batches is a sequential process, there is no real way to parallelize this process, at least a way that is introductory and beneficial to what your goals appear to be. If you do give the LSTM more data per timestep, however, this surely will increase your utilization.

statefull in an LSTM does not do what you think it does. An LSTM always remembers the sequence it is iterating over as it updates it's internal hidden states, h and c. Furthermore, the weight transformations that "build" those internal states are learned during training. What stateful does is preserve the previous hidden state from the last batch index. Meaning, the final hidden state at the third element in the batch is sent as the initial hidden state in the third element of the next batch and so on. I do not believe this is useful for your applications.

There are downsides to training the LSTM with one sample per batch. In general, training with min-batches increases stability. However, you appear to not be training with one sample per batch but instead one timestep per sample.

Edit (from comments)

If you use stateful and send the next 'character' of your sequence in the same index of the previous batch this would be analogous to sending the full sequence timesteps per sample. I would still recommend the initial approach described above in order to improve the speed of the application and to be in more line with other LSTM applications. I see no disadvantages to the approach of sending the full sequence per sample instead of doing it along every batch. However, the advantage of speed, being able to shuffle your input data per batch, and being more readable/consistent would be worth the change IMO.

Thanks for the answer! I tried to be clear, but it seems I can't yet articulate these things very well. My specific application is - not exactly but analoguous to - generating text one character at a time. To pick your brain a bit more. Reading that github article on stateful I'm not sure if I don't understand or just don't have the words. To have a stateful network is to preserve state between batches. I use this to feed the network one sample and label at a time (and yes, one timestep, because the time steps are in this case the batches). Correct? To Be Continued.. — Felix, Aug 08 '18 at 19:51
... Right Now. What I didn't get in your answer was "third element is sent as the initial hidden state in the third element of the next batch". Why not so that the last index is the first state of the next batch? That's how I imagined it. But in the case of a size of one, that would be equivalent, right? Many thanks for baring with a newbie. — Felix, Aug 08 '18 at 19:53
But I do have different, independent sequences that could be batched! That would be an idea worth looking into. — Felix, Aug 08 '18 at 19:55
Makes sense. Let me try to clarify one last thing. My knowledge is that I have to slice up the sequence to shapes `X: (batches, steps, features) Y: (batches, 1, features)`. This will increase the memory consumption heavily as I outlined in the original question, as each predicted Y must have a whole array of X to train. Can this be rephrased such that this approach of having X and Y shifted one step and predicting all simultaneously for one sequence be achieved inside one batch? Wow that's a load of questions for one SO entry.. I'm almost embarassed. — Felix, Aug 08 '18 at 20:02

On training LSTMs efficiently but well, parallelism vs training regime

1 Answers1

Linked