I am trying to write a Keras model (using the Tensorflow backend) that uses an LSTM to predict labels for sequences like you would in a part-of-speech labeling task. The model I have written returns nan
as a loss for all training epochs and for all label predictions. I suspect I have my model configured incorrectly, but I can't figure out what I'm doing wrong.
The full program is here.
from random import shuffle, sample
from typing import Tuple, Callable
from numpy import arange, zeros, array, argmax, newaxis
def sequence_to_sequence_model(time_steps: int, labels: int, units: int = 16):
from keras import Sequential
from keras.layers import LSTM, TimeDistributed, Dense
model = Sequential()
model.add(LSTM(units=units, input_shape=(time_steps, 1), return_sequences=True))
model.add(TimeDistributed(Dense(labels)))
model.compile(loss='categorical_crossentropy', optimizer='adam')
return model
def labeled_sequences(n: int, sequence_sampler: Callable[[], Tuple[array, array]]) -> Tuple[array, array]:
"""
Create training data for a sequence-to-sequence labeling model.
The features are an array of size samples * time steps * 1.
The labels are a one-hot encoding of time step labels of size samples * time steps * number of labels.
:param n: number of sequence pairs to generate
:param sequence_sampler: a function that returns two numeric sequences of equal length
:return: feature and label sequences
"""
from keras.utils import to_categorical
xs, ys = sequence_sampler()
assert len(xs) == len(ys)
x = zeros((n, len(xs)), int)
y = zeros((n, len(ys)), int)
for i in range(n):
xs, ys = sequence_sampler()
x[i] = xs
y[i] = ys
x = x[:, :, newaxis]
y = to_categorical(y)
return x, y
def digits_with_repetition_labels() -> Tuple[array, array]:
"""
Return a random list of 10 digits from 0 to 9. Two of the digits will be repeated. The rest will be unique.
Along with this list, return a list of 10 labels, where the label is 0 if the corresponding digits is unique and 1
if it is repeated.
:return: digits and labels
"""
n = 10
xs = arange(n)
ys = zeros(n, int)
shuffle(xs)
i, j = sample(range(n), 2)
xs[j] = xs[i]
ys[i] = ys[j] = 1
return xs, ys
def main():
# Train
x, y = labeled_sequences(1000, digits_with_repetition_labels)
model = sequence_to_sequence_model(x.shape[1], y.shape[2])
model.summary()
model.fit(x, y, epochs=20, verbose=2)
# Test
x, y = labeled_sequences(5, digits_with_repetition_labels)
y_ = model.predict(x, verbose=0)
x = x[:, :, 0]
for i in range(x.shape[0]):
print(' '.join(str(n) for n in x[i]))
print(' '.join([' ', '*'][int(argmax(n))] for n in y[i]))
print(y_[i])
if __name__ == '__main__':
main()
My feature sequences are arrays of 10 digits from 0 to 9. My corresponding label sequences are arrays of 10 zeros and ones where zero indicates a unique digit and one indicates a repeated digit. (The idea is to create a simple classification task that incorporates long-distance dependencies.)
Training looks like this
Epoch 1/20
- 1s - loss: nan
Epoch 2/20
- 0s - loss: nan
Epoch 3/20
- 0s - loss: nan
And all the label array predictions look like this
[[nan nan]
[nan nan]
[nan nan]
[nan nan]
[nan nan]
[nan nan]
[nan nan]
[nan nan]
[nan nan]
[nan nan]]
So clearly something is wrong.
The features matrix passed to model.fit
is of dimensionality samples
× time steps
× 1
. The labels matrix is of dimensionality samples
× time steps
× 2
, where the 2 comes from a one-hot encoding of the labels 0 and 1.
I'm using a time-distributed dense layer to predict sequences, following the Keras documentation and posts like this and this. To the best of my knowledge, the model topology defined in sequence_to_sequence_model
above is correct. The model summary looks like this
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 10, 16) 1152
_________________________________________________________________
time_distributed_1 (TimeDist (None, 10, 2) 34
=================================================================
Total params: 1,186
Trainable params: 1,186
Non-trainable params: 0
_________________________________________________________________
Stack Overflow questions like this make it sound like nan
results are an indicator of numeric problems: runaway gradients and whatnot. However, since I am working on a tiny set data and every number that comes back from my model is a nan
, I suspect I'm not seeing a numeric problem, but rather a problem with how I have constructed the model.
Does the code above have the right model/data shape for sequence-to-sequence learning? If so, why do I get nan
s everywhere?