For a model that I intend to spontaneously generate sequences I find that training it sample by sample and keeping state in between feels most natural. I've managed to construct this in Keras after reading many helpful resources. (SO: Q and two fantastic answers, Macine Learning Mastery 1, 2, 3)
First a sequence is constructed (in my case one-hot encoded too). X and Y are procuded from this sequence by shifting Y forward one time step. Training is done in batches of one sample and one time step.
For Keras this looks something like this:
data = get_some_data() # Shape (samples, features)
Y = data[1:, :] # Shape (samples-1, features)
X = data[:-1, :].reshape((-1, 1, data.shape[-1])) # Shape (samples-1, 1, features)
model = Sequential()
model.add(LSTM(256, batch_input_shape=(1, 1, X.shape[-1]), stateful=True))
model.add(Dense(Y.shape[-1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
for epoch in range(10):
model.fit(X, Y, batch_size=1, shuffle=False)
model.reset_states()
It does work. However after consulting my Task Manager, it seems it's only using ~10 % of my GPU resources, which are already quite limited. I'd like to improve this to speed up training. Increasing batch size would allow for parallel computations.
The network in its current state presumably "remembers" things even from the start of the training sequence. For training in batches one would need to first set up the sequence and then predict one value - and do this for multiple values. To train on the full sequence one would need to generate data in the shape (samples-steps, steps, features)
. I imagine it wouldn't be uncommon to have a sequence spanning at least a couple of hundred time steps. So that would mean a huge increase in the data amount.
Between framing the problem a bit differently and requiring more data to be stored in memory, and utilising only a small amount of processing resources, I must ask:
- Is my intuition of the natural way of training and statefulness correct?
- Are there other downsides to this training with one sample per batch?
- Could the utilisation issues be resolved any other way?
- Finally, is there an accepted way of performing this kind of training to generate long sequences?
Any help is greatly appreciated, I'm fairly new with LSTMs.