5

In Caffe we have a decay_ratio which is usually set as 0.0005. Then all trainable parameters, e.g., W matrix in FC6 will be decayed by: W = W * (1 - 0.0005) after we applied the gradient to it.

I go through many tutorial tensorflow codes, but do not see how people implement this weight decay to prevent numerical problems (very large absolute values)

I my experiences, I often run into numerical problems aften 100k iterations during training.

I also go through related questions at stackoverflow, e.g., How to set weight cost strength in TensorFlow? However, the solution seems a little different as implemented in Caffe.

Does anyone has similar concerns? Thank you.

Community
  • 1
  • 1
user2868512
  • 51
  • 1
  • 1
  • 2

2 Answers2

4

The current answer is wrong in that it doesn't give you proper "weight decay as in cuda-convnet/caffe" but instead L2-regularization, which is different.

When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.

Weight decay (don't know how to TeX here, so excuse my pseudo-notation):

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2-regularization:

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)

That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.

One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.

Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.

Edit: see also this PR which just got merged into TF.

LucasB
  • 3,253
  • 1
  • 28
  • 31
3

This is a duplicate question:

How to define weight decay for individual layers in TensorFlow?

# Create your variables
weights = tf.get_variable('weights', collections=['variables'])

with tf.variable_scope('weights_norm') as scope:
  weights_norm = tf.reduce_sum(
  input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
      [tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
  ),
  name='weights_norm'
)

# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)

# Add the other loss components to the collection losses     
# ...

# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')

You can just set whatever lambda value you want to the weight decay. The above just adds the l2 norm to it.

Community
  • 1
  • 1
Steven
  • 5,134
  • 2
  • 27
  • 38
  • Wouldn't `tf.reduce_mean` make more sense than sum? Then the weight decay will be (more) invariant wrt network size – Toke Faurby Nov 10 '17 at 20:51
  • It doesn't really make too much sense to use reduce mean as it being computed over the l2 on the weights. Because then it's suggesting that each weight vector should contribute just as much as every other but some weights might correspond to a really large vector while others might correspond to small ones. Feel free to use it though and it might improve performance as I haven't tested both approaches to compare. – Steven Nov 12 '17 at 01:13
  • I still feel that it is strange that the weight decay parameter should depend on the number of weight vectors, but I agree that using the mean makes even less sense. – Toke Faurby Nov 12 '17 at 17:34
  • This is wrong (as in: not the same as caffe) for any optimizer other than pure SGD. See the formula in OP, the loss you propose is the same with raw SGD, but when momentum and other advanced optimizers come into play, your loss and weight-decay in caffe do very different things. – LucasB Jun 06 '18 at 08:02