caffe python manual sgd

Question

I am trying to implement the SGD functionality to update weights in python manually in caffe python instead of using solver.step() function. The goal is to match the weight updates after doing solver.step() and that by manually updating the weights.

The setup is as follows: Use MNIST data. Set the random seed in solver.prototxt as: random_seed: 52. Make sure momentum: 0.0 and, base_lr: 0.01, lr_policy: "fixed". Above is done so that, I can simply implement the SGD update equation (with out momentum, regularization etc.). The equation is simply: W_t+1 = W_t - mu * W_t_diff

Following are the two tests:

Test1: Using caffe's forward() and backward() to calculate the forward propagation and backward propagation. For each layer that contain weights I do:

    for k in weight_layer_idx:
        solver.net.layers[k].blobs[0].diff[...] *= lr # weights
        solver.net.layers[k].blobs[1].diff[...] *= lr # biases

Next, update the weight/biases as:

        solver.net.layers[k].blobs[0].data[...] -= solver.net.layers[k].blobs[0].diff
        solver.net.layers[k].blobs[1].data[...] -= solver.net.layers[k].blobs[1].diff

I run this for 5 iterations.

Test2: Run caffe's solver.step(5).

Now, what I expect is the two tests should yield exactly same weights after the two iterations.

I save the weights values after each of the above tests and calculate the norm difference between the weight vectors by the two tests, and I see that they are not bit-exact. Can some one spot something that I might be missing?

Following is the entire code for reference:

import caffe
caffe.set_device(0)
caffe.set_mode_gpu()
import numpy as np

niter = 5
solver = None
solver = caffe.SGDSolver('solver.prototxt')

# Automatic SGD: TEST2
solver.step(niter)
# save the weights to compare later
w_solver_step = copy(solver.net.layers[1].blobs[0].data.astype('float64'))
b_solver_step = copy(solver.net.layers[1].blobs[1].data.astype('float64'))

# Manual SGD: TEST1
solver = None
solver = caffe.SGDSolver('solver.prototxt')
lr = 0.01
momentum = 0.

# Get layer types
layer_types = []
for ll in solver.net.layers:
    layer_types.append(ll.type)

# Get the indices of layers that have weights in them
weight_layer_idx = [idx for idx,l in enumerate(layer_types) if 'Convolution' in l or 'InnerProduct' in l]

for it in range(1, niter+1):
    solver.net.forward()  # fprop
    solver.net.backward()  # bprop
    for k in weight_layer_idx:
        solver.net.layers[k].blobs[0].diff[...] *= lr
        solver.net.layers[k].blobs[1].diff[...] *= lr
        solver.net.layers[k].blobs[0].data[...] -= solver.net.layers[k].blobs[0].diff
        solver.net.layers[k].blobs[1].data[...] -= solver.net.layers[k].blobs[1].diff

# save the weights to compare later
w_fwdbwd_update = copy(solver.net.layers[1].blobs[0].data.astype('float64'))
b_fwdbwd_update = copy(solver.net.layers[1].blobs[1].data.astype('float64'))

# Compare
print "after iter", niter, ": weight diff: ", np.linalg.norm(w_solver_step - w_fwdbwd_update), "and bias diff:", np.linalg.norm(b_solver_step - b_fwdbwd_update)

The last line that compares the weights with the two tests produces:

after iter 5 : weight diff: 0.000203027766144 and bias diff: 1.78390789051e-05 where as I expect this difference to be 0.0

Any ideas?

Have you set [`weight_decay`](http://stackoverflow.com/q/32177764/1714410) to zero in solver.prototxt? — Shai, Apr 06 '16 at 20:58
Yes, I forgot to mention previously but `weight_decay: 0.0` is set. What is happening is, if I run these two tests for only 1 iteration, I get exactly matching weight vectors from all layers, but not the subsequent iterations. — Aniket, Apr 07 '16 at 00:09
Might be the momentum in the gradients. Try setting momentum to zero. — mkuse, Aug 09 '16 at 15:10

score 4 · Answer 1 · answered Nov 20 '16 at 20:17

You got it almost right, you just need to set the diffs to zero after each update. Caffe wont do this automatically to give you the opportunity to implement batch accumulation (acummulate the gradients over multiple batches for one weight update, this can be helpful if your memory is not large enough for your desired batch size).

Another possible problem could be the use of cudnn, its implementation of the convolution is non-deterministic (or how it is set to be used in caffe to be precise). In general, this should be no problem, but in your case it causes slightly different results each time and therefore different weights. If you compiled caffe with cudnn you can simply set the mode to cpu to prevent that from happening while testing.

Also, the solver parameters have an impact on the weight updates. As you noted, you should be aware of:

lr_policy: "fixed"
momentum: 0
weight_decay: 0
random_seed: 52 # or any other constant

In the nets, be sure not to use learn rate multipliers, often the biases are learned twice as fast as the weights, but this is not the behaviour you implemented. So you need to be sure to set them to one in the layer definitions:

param {
    lr_mult: 1 # weight lr multiplier
  }
param {
    lr_mult: 1 # bias lr multiplier
  }

Last but not least, here an example how your code would look like with momentum, weight decay and lr_mult. In CPU mode, this produces the expected output (no differences):

import caffe
caffe.set_device(0)
caffe.set_mode_cpu()
import numpy as np

niter = 5
solver = None
solver = caffe.SGDSolver('solver.prototxt')

# Automatic SGD: TEST2
solver.step(niter)
# save the weights to compare later
w_solver_step = solver.net.layers[1].blobs[0].data.copy()
b_solver_step = solver.net.layers[1].blobs[1].data.copy()

# Manual SGD: TEST1
solver = None
solver = caffe.SGDSolver('solver.prototxt')
base_lr = 0.01
momentum = 0.9
weight_decay = 0.0005
lr_w_mult = 1
lr_b_mult = 2

momentum_hist = {}
for layer in solver.net.params:
    m_w = np.zeros_like(solver.net.params[layer][0].data)
    m_b = np.zeros_like(solver.net.params[layer][1].data)
    momentum_hist[layer] = [m_w, m_b]

for i in range(niter):
    solver.net.forward()
    solver.net.backward()
    for layer in solver.net.params:
        momentum_hist[layer][0] = momentum_hist[layer][0] * momentum + (solver.net.params[layer][0].diff + weight_decay *
                                                       solver.net.params[layer][0].data) * base_lr * lr_w_mult
        momentum_hist[layer][1] = momentum_hist[layer][1] * momentum + (solver.net.params[layer][1].diff + weight_decay *
                                                       solver.net.params[layer][1].data) * base_lr * lr_b_mult
        solver.net.params[layer][0].data[...] -= momentum_hist[layer][0]
        solver.net.params[layer][1].data[...] -= momentum_hist[layer][1]
        solver.net.params[layer][0].diff[...] *= 0
        solver.net.params[layer][1].diff[...] *= 0

# save the weights to compare later
w_fwdbwd_update = solver.net.layers[1].blobs[0].data.copy()
b_fwdbwd_update = solver.net.layers[1].blobs[1].data.copy()

# Compare
print "after iter", niter, ": weight diff: ", np.linalg.norm(w_solver_step - w_fwdbwd_update), "and bias diff:", np.linalg.norm(b_solver_step - b_fwdbwd_update)

caffe python manual sgd

1 Answers1