5

I am trying to write a Keras model (using the Tensorflow backend) that uses an LSTM to predict labels for sequences like you would in a part-of-speech labeling task. The model I have written returns nan as a loss for all training epochs and for all label predictions. I suspect I have my model configured incorrectly, but I can't figure out what I'm doing wrong.

The full program is here.

from random import shuffle, sample
from typing import Tuple, Callable

from numpy import arange, zeros, array, argmax, newaxis


def sequence_to_sequence_model(time_steps: int, labels: int, units: int = 16):
    from keras import Sequential
    from keras.layers import LSTM, TimeDistributed, Dense

    model = Sequential()
    model.add(LSTM(units=units, input_shape=(time_steps, 1), return_sequences=True))
    model.add(TimeDistributed(Dense(labels)))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model


def labeled_sequences(n: int, sequence_sampler: Callable[[], Tuple[array, array]]) -> Tuple[array, array]:
    """
    Create training data for a sequence-to-sequence labeling model.

    The features are an array of size samples * time steps * 1.
    The labels are a one-hot encoding of time step labels of size samples * time steps * number of labels.

    :param n: number of sequence pairs to generate
    :param sequence_sampler: a function that returns two numeric sequences of equal length
    :return: feature and label sequences
    """
    from keras.utils import to_categorical

    xs, ys = sequence_sampler()
    assert len(xs) == len(ys)
    x = zeros((n, len(xs)), int)
    y = zeros((n, len(ys)), int)
    for i in range(n):
        xs, ys = sequence_sampler()
        x[i] = xs
        y[i] = ys
    x = x[:, :, newaxis]
    y = to_categorical(y)
    return x, y


def digits_with_repetition_labels() -> Tuple[array, array]:
    """
    Return a random list of 10 digits from 0 to 9. Two of the digits will be repeated. The rest will be unique.
    Along with this list, return a list of 10 labels, where the label is 0 if the corresponding digits is unique and 1
    if it is repeated.

    :return: digits and labels
    """
    n = 10
    xs = arange(n)
    ys = zeros(n, int)
    shuffle(xs)
    i, j = sample(range(n), 2)
    xs[j] = xs[i]
    ys[i] = ys[j] = 1
    return xs, ys


def main():
    # Train
    x, y = labeled_sequences(1000, digits_with_repetition_labels)
    model = sequence_to_sequence_model(x.shape[1], y.shape[2])
    model.summary()
    model.fit(x, y, epochs=20, verbose=2)
    # Test
    x, y = labeled_sequences(5, digits_with_repetition_labels)
    y_ = model.predict(x, verbose=0)
    x = x[:, :, 0]
    for i in range(x.shape[0]):
        print(' '.join(str(n) for n in x[i]))
        print(' '.join([' ', '*'][int(argmax(n))] for n in y[i]))
        print(y_[i])


if __name__ == '__main__':
    main()

My feature sequences are arrays of 10 digits from 0 to 9. My corresponding label sequences are arrays of 10 zeros and ones where zero indicates a unique digit and one indicates a repeated digit. (The idea is to create a simple classification task that incorporates long-distance dependencies.)

Training looks like this

Epoch 1/20
 - 1s - loss: nan
Epoch 2/20
 - 0s - loss: nan
Epoch 3/20
 - 0s - loss: nan

And all the label array predictions look like this

[[nan nan]
 [nan nan]
 [nan nan]
 [nan nan]
 [nan nan]
 [nan nan]
 [nan nan]
 [nan nan]
 [nan nan]
 [nan nan]]

So clearly something is wrong.

The features matrix passed to model.fit is of dimensionality samples × time steps × 1. The labels matrix is of dimensionality samples × time steps × 2, where the 2 comes from a one-hot encoding of the labels 0 and 1.

I'm using a time-distributed dense layer to predict sequences, following the Keras documentation and posts like this and this. To the best of my knowledge, the model topology defined in sequence_to_sequence_model above is correct. The model summary looks like this

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 10, 16)            1152      
_________________________________________________________________
time_distributed_1 (TimeDist (None, 10, 2)             34        
=================================================================
Total params: 1,186
Trainable params: 1,186
Non-trainable params: 0
_________________________________________________________________

Stack Overflow questions like this make it sound like nan results are an indicator of numeric problems: runaway gradients and whatnot. However, since I am working on a tiny set data and every number that comes back from my model is a nan, I suspect I'm not seeing a numeric problem, but rather a problem with how I have constructed the model.

Does the code above have the right model/data shape for sequence-to-sequence learning? If so, why do I get nans everywhere?

W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111

3 Answers3

1

By default the Dense layer has no activation. If you specify one, the nans go away. Change the following line in the code above.

model.add(TimeDistributed(Dense(labels, activation='softmax')))
W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111
  • 1
    But how about if it is a regression model? I am seeing above same issue sometimes when I run my model (and sometimes not). Nans come from 1st epoch itself, so this is not a cause of exploding or vanishing gradients. Mine is a regression model – Allohvk Jun 09 '21 at 05:58
0

If the model weights and the loss become NaN quickly, this is an indicator for exploding gradients. I would add a batch normalization after the LSTM layer and check if it helps.

from keras.layers.normalization import BatchNormalization

# [...]
model.add(LSTM(units=units, input_shape=(time_steps, 1), return_sequences=True))
model.add(BatchNormalization())

For me (on a categorical classification problem) batch normalization solved the issue.

0

First of all, check the predictions before training. If the model already gives you NaNs, then there could be something wrong in you data too:

  • Check the dtype, try double precision (i.e. tf.float64).
  • Make sure you data is sound.

Otherwise, you could:

  • Add LayerNormalization after the LSTM.
  • Try a different kernel initializer, and/or with one with smaller variance.
  • A lower learning rate (Adam's default is 0.001).
  • Gradient clipping: when you compile the model, do model.compile(..., optimizer=Adam(clipnorm=1.0)). Specifying a global norm of 1 is usually a good default.
  • Change the optimizer's epsilon value.
  • Change optimizer, like something simpler as SGD.
  • Try to define or edit the loss function, it's possible the default categorical_crossentropy doesn't handle well 3D tensors.
Luca Anzalone
  • 633
  • 1
  • 9