Tensorflow loss is diverging in my RNN

Question

I'm trying to get my hand wet with Tensorflow by solving this challenge: https://www.kaggle.com/c/integer-sequence-learning.

My work is based on these blog posts:

A complete working example - with my data - can be found here: https://github.com/bottiger/Integer-Sequence-Learning Running the example will print out a lot of debug information. Run execute rnn-lstm-my.py . (Requires tensorflow and pandas)

The approach is pretty straight forward. I load all of my train sequences, store their length in a vector and the length of the longest one in a variable I call ''max_length''.

In my training data I strip out the last element in all the sequences and store it in a vector called "train_solutions"

The I store all the sequences, padded with zeros, in a matrix with the shape: [n_seq, max_length].

Since I want to predict the next number in a sequence my output should be a single number, and my input should be a sequence.

I use a RNN (tf.nn.rnn) with a BasicLSTMCell as cell, with 24 hidden units. The output is feeded into a basic linear model (xW+B) which should produce my prediction.

My cost function is simply the predicted number of my model, I calculate the cost like this:

    cost = tf.nn.l2_loss(tf_result - prediction)

The basics dimensions seems to be correct because the code actually runs. However, after only one or two iterations some NaN starts to occur which quickly spreads, and everything becomes NaN.

Here is the important part of the code where I define and run the graph. However, I have omitted posted loading/preparation of the data. Please look at the git repo for details about that - but I pretty sure that part is correct.

cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, state_is_tuple=True)

num_inputs = tf.placeholder(tf.int32, name='NumInputs')
seq_length = tf.placeholder(tf.int32, shape=[batch_size], name='NumInputs')

# Define the input as a list (num elements = batch_size) of sequences
inputs = [tf.placeholder(tf.float32,shape=[1, max_length], name='InputData') for _ in range(batch_size)]

# Result should be 1xbatch_szie vector
result = tf.placeholder(tf.float32, shape=[batch_size, 1], name='OutputData')

tf_seq_length = tf.Print(seq_length, [seq_length, seq_length.get_shape()], 'SequenceLength: ')

outputs, states = tf.nn.rnn(cell, inputs, dtype=tf.float32) 

# Print the output. The NaN first shows up here
outputs2 = tf.Print(outputs, [outputs], 'Last: ', name="Last", summarize=800)

# Define the model
tf_weight = tf.Variable(tf.truncated_normal([batch_size, num_hidden, frame_size]), name='Weight')
tf_bias   = tf.Variable(tf.constant(0.1, shape=[batch_size]), name='Bias')

# Debug the model parameters
weight = tf.Print(tf_weight, [tf_weight, tf_weight.get_shape()], "Weight: ")
bias = tf.Print(tf_bias, [tf_bias, tf_bias.get_shape()], "bias: ")

# More debug info
print('bias: ', bias.get_shape())
print('weight: ', weight.get_shape())
print('targets ', result.get_shape())
print('RNN input ', type(inputs))
print('RNN input len()', len(inputs))
print('RNN input[0] ', inputs[0].get_shape())

# Calculate the prediction
tf_prediction = tf.batch_matmul(outputs2, weight) + bias
prediction = tf.Print(tf_prediction, [tf_prediction, tf_prediction.get_shape()], 'prediction: ')

tf_result = result

# Calculate the cost
cost = tf.nn.l2_loss(tf_result - prediction)

#optimizer = tf.train.AdamOptimizer()
learning_rate  = 0.05
optimizer = tf.train.GradientDescentOptimizer(learning_rate)


minimize = optimizer.minimize(cost)

mistakes = tf.not_equal(tf.argmax(result, 1), tf.argmax(prediction, 1))
error = tf.reduce_mean(tf.cast(mistakes, tf.float32))

init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)

no_of_batches = int(len(train_input)) / batch_size
epoch = 1

val_dict = get_input_dict(val_input, val_output, train_length, inputs, batch_size)

for i in range(epoch):
    ptr = 0
    for j in range(no_of_batches):

    print('eval w: ', weight.eval(session=sess))

    # inputs batch
    t_i = train_input[ptr:ptr+batch_size]

    # output batch
    t_o = train_output[ptr:ptr+batch_size]

    # sequence lengths
    t_l = train_length[ptr:ptr+batch_size]

    sess.run(minimize,feed_dict=get_input_dict(t_i, t_o, t_l, inputs, batch_size))

    ptr += batch_size

    print("result: ", tf_result)
    print("result len: ", tf_result.get_shape())
    print("prediction: ", prediction)
    print("prediction len: ", prediction.get_shape())


    c_val = sess.run(error, feed_dict = val_dict )
    print "Validation cost: {}, on Epoch {}".format(c_val,i)


    print "Epoch ",str(i)

print('test input: ', type(test_input))
print('test output: ', type(test_output))

incorrect = sess.run(error,get_input_dict(test_input, test_output, test_length, inputs, batch_size))

sess.close()

And here is (the first lines of) the output it produces. You can see that everything become NaN: http://pastebin.com/TnFFNFrr (I could not post it here due to the body limit)

The first time I see the NaN is here:

I tensorflow/core/kernels/logging_ops.cc:79] Last: [0 0.76159418 0 0 0 0 0 -0.76159418 0 -0.76159418 0 0 0 0.76159418 0.76159418 0 -0.76159418 0.76159418 0 0 0 0.76159418 0 0 0 nan nan nan nan 0 0 nan nan 1 0 nan 0 0.76159418 nan nan nan 1 0 nan 0 0.76159418 nan nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]

I hope I made my problem clear. Thanks in advance

score 4 · Accepted Answer · edited May 23 '17 at 11:53

4

RNNs suffer from an exploding gradient, so you should clip the gradients for the RNN parameters. Look at this post:

How to effectively apply gradient clipping in tensor flow?

edited May 23 '17 at 11:53

Community

1
1

answered Aug 05 '16 at 13:46

Vincent Renkens

201
1
3

score 0 · Answer 2 · answered Oct 14 '17 at 13:02

0

use AdamOptimizer instead

optimizer = tf.train.AdamOptimizer()

answered Oct 14 '17 at 13:02

mhbashari

482
3
16

score 0 · Answer 3 · answered Nov 17 '22 at 10:31

0

Try using LSTM which is more optimized and better version of RNN or use Relu as activation function. Our Normal Rnn architecture has some disadvantages when implemented on a large network it gives uneven or fixed losses which causes the model to not train properly, this problem in RNN occurs due to activation function such as sigmoid or tanh and the problem is called vanishing gradient if losses are constant or exploding gradient if they show hue deflection

Give below is the code for LSTM. CODE SNIPPET

answered Nov 17 '22 at 10:31

Shreyas Shirdhankar

1

1

Welcome to Stack Overflow. [Please don't post screenshots of text](https://meta.stackoverflow.com/a/285557/354577). They can't be searched or copied, or even consumed by users of adaptive technologies like screen readers. Instead, paste the code as text directly into your question. If you select it and click the `{}` button or Ctrl+K the code block will be indented by four spaces, which will cause it to be rendered as code. – ChrisGPT was on strike Nov 19 '22 at 22:16

Tensorflow loss is diverging in my RNN

3 Answers3