Why are my TensorFlow network weights and costs NaN when I use RELU activations?

Question

I can't get TensorFlow RELU activations (neither tf.nn.relu nor tf.nn.relu6) working without NaN values for activations and weights killing my training runs.

I believe I'm following all the right general advice. For example I initialize my weights with

weights = tf.Variable(tf.truncated_normal(w_dims, stddev=0.1))
biases = tf.Variable(tf.constant(0.1 if neuron_fn in [tf.nn.relu, tf.nn.relu6] else 0.0, shape=b_dims))

and use a slow training rate, e.g.,

tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)

But any network of appreciable depth results in NaN for cost and and at least some weights (at least in the summary histograms for them). In fact, the cost is often NaN right from the start (before training).

I seem to have these issues even when I use L2 (about 0.001) regularization, and dropout (about 50%).

Is there some parameter or setting that I should adjust to avoid these issues? I'm at a loss as to where to even begin looking, so any suggestions would be appreciated!

there is nothing magical about relu. Error is in your code, thus you should provide it. Why do you initialize bias to 0.1 instaed of 0? Why not simply tf.Variable + tf.zeros ? — lejlot, May 25 '16 at 22:42
@lejlot: The idea for 0.1 comes [from Google](https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros/index.html#weight-initialization). — orome, May 25 '16 at 23:00
How many layers are there in your network? It seems like a gradient explosion problem. — Lifu Huang, May 26 '16 at 02:01
@LifuHuang: The problem seems to appear with >4. But even when I can avoid my NaN issue, RELUs don't actually seem to work that well. — orome, May 26 '16 at 12:14
@lejlot: Actually, the problem vanishes if I simply change stddev=0.1 to stddev=0.01. But as you say (and despite quite a bit of what I've read) RELU is not magical. In fact, training is no faster and much more erratic. I'm not sure why there's so much hype about them. Is there a general set of changes I nee to make to a successful model with sigmoid activations to get it working with RELUs? Clearly (or at least it seems from my experience here) making sure all my weights are ["slightly positive"](https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros/index.html#weight-initialization). — orome, May 26 '16 at 12:19
the "hype" is about many things. In particular for actually deep networks (lets say of at least 10-20 hidden layers), relu behave way better than sigmoids. They converge faster and to better solutions, they are easier to implement (and faster to compute, which is important if you put this on gpus). There are some new, specific heuristics for initialization which are well suited for relus (and different from old sigmoid-based ones), which you can find in nips papers. — lejlot, May 26 '16 at 19:38
@lejlot: Can you point me to particular NIPS papers on RELU initialization? — orome, May 26 '16 at 19:53
@lejlot: So something like `stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))` — orome, May 26 '16 at 21:47
@lejlot: If that's right (and it does seem to work much better, though I still get occasional explosions), I'd take it (your article link and some TF code illustrating an implementation) as the answer. — orome, Jun 02 '16 at 12:59

score 7 · Accepted Answer · edited May 23 '17 at 12:17

7

Following He et. al (as suggested in lejlot's comment), initializing the weights of the l-th layer to a zero-mean Gaussian distribution with standard deviation

where n_l is the flattened length of the the input vector or

stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))

results in weights that generally do not diverge.

edited May 23 '17 at 12:17

Community

1
1

answered Jun 10 '16 at 18:59

orome

45,163
57
202
418

score 5 · Answer 2 · answered May 26 '16 at 23:32

If you use a softmax classifier at the top of your network, try to make the initial weights of the layer just below the softmax very small (e.g. std=1e-4). This makes the initial distribution of outputs of the network very soft (high temperature), and helps ensure that the first few steps of your optimization are not too large and numerically unstable.

score 5 · Answer 3 · edited May 23 '17 at 11:53

Have you tried gradient clipping and/or a smaller learning rate?

Basically, you will need to process your gradients before applying them, as follows (from tf docs, mostly):

# Replace this with what follows
# opt = tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)

# Create an optimizer.
opt = tf.train.MomentumOptimizer(learning_rate=0.001, momentum=0.5)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(cross_entropy_loss, tf.trainable_variables())

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_value(gv[0], -5., 5.), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt = opt.apply_gradients(capped_grads_and_vars)

Also, the discussion in this question might help.

Why are my TensorFlow network weights and costs NaN when I use RELU activations?

3 Answers3

Linked