Derivatives in some Deconvolution layers mostly all zeroes

Question

This is a really weird error, partly a follow-up to the previous question(Deconvolution layer FCN initialization - loss drops too fast).

However I init Deconv layers (bilinear or gaussian), I get the same situation:

1) Weights are updated, I checked this for multiple iterations. The size of deconvolution/upsample layers is the same: (2,2,8,8)

First of all, net_mcn.layers[idx].blobs[0].diff return matrices with floats, the last Deconv layer (upscore5) produces two array with the same numbers with opposite signs, i.e. weights should be going at the same rate in different directions, but the resulting weights are in fact almost identical!

Quite surprisingly, the remaining four deconv layers do not have this error. So when I compare models, for example, for iter=5000 and iter=55000 deconv layers weights are very different.

Even more surprisingly, other layers (convolutional) change much less!

Here's the bit of the printout at the init to confirm that deconv layers are updated:

I0724 03:10:30.451787 32249 net.cpp:198] loss needs backward computation.
I0724 03:10:30.451792 32249 net.cpp:198] score_final needs backward computation.
I0724 03:10:30.451797 32249 net.cpp:198] upscore5 needs backward computation.
I0724 03:10:30.451802 32249 net.cpp:198] upscore4 needs backward computation.
I0724 03:10:30.451804 32249 net.cpp:198] upscore3 needs backward computation.
I0724 03:10:30.451807 32249 net.cpp:198] upscore2 needs backward computation.
I0724 03:10:30.451810 32249 net.cpp:198] upscore needs backward computation.
I0724 03:10:30.451814 32249 net.cpp:198] score_fr3 needs backward computation.
I0724 03:10:30.451818 32249 net.cpp:198] score_fr2 needs backward computation.
I0724 03:10:30.451822 32249 net.cpp:198] score_fr needs backward computation.

2) Blobs diffs are all zeros for deconvolution layers

Data stream (Finding gradient of a Caffe conv-filter with regards to input) diffs for almost ALL deconv layers are all zeroes for the full duration of the algorithm, with a few exceptions (also near 0 like -2.28945263e-09).

Convolution layer diffs look OK.

I see this as as a paradox - the weights in the deconv layers are updated but diffs wrt to the neurons are all 0's (constant?)

3) Deconv features grow really large quickly

Far larger than in FCNs and CRFasRNN, up to 5.4e+03, at the same time nearby pixels can have very varying values (e.g. 5e+02 and -300) for the same class.

4) Training and validation error go down, often very quickly

As I pointed out in the referred question.

So putting it all together- I don't understand what to make of it. If it is overfitting, then why does validation error reduces too?

The architecture of the network is

fc7->relu1->dropout->conv2048->conv1024->conv512->deconv1->deconv2->deconv3->deconv4->deconv5->crop->softmax_with_loss

EDIT: I was wrong, not all entries in all net.blobs[...].diffs are 0's, but mostly as layers get larger. This depends on the data size.

Are you trying to do figure/ground segmentation? is it possible that most pixles in your examples are labeled "background"? Is it possible you are facing a severe inbalance of the labels? — Shai, Jul 24 '17 at 20:03
This is quite weird. Th input is 4096x8x8 (fc7 in FCN). net.blob[...].diffs elements depend only on the HxW of the layer =for a BxCx11x11 layer 9x9 elements in the upper left corner are non-0, for BxCx538x538 122x122 somewhere in the middle. The network itself is very bad, I just wanted to check these gradients. Does this really somehow depend on input size? — Alex, Jul 26 '17 at 04:06
Hence I added an edit. I'm sure Caffe explains it somewhere in detail, but I didn't find it. Also I can't find where and how data diffs are used. — Alex, Jul 26 '17 at 05:00
data diffs are used to update weights. In order to get the derivative w.r.t weight in intermediate layer, you need the derivatives of the loss w.r.t the layer's "top" in order to compute the derivatives of the weights. — Shai, Jul 26 '17 at 05:05
SO doesn't allow for Latex, but the partial derivative equation for the weight is dE/dw = dE/dy * dy/ds * ds/dw. i.e. derivative of loss func wrt output, times output wrt to linear sum of input times sum wrt to weight, Which of these is blob[...].diffs? — Alex, Jul 26 '17 at 05:16

Derivatives in some Deconvolution layers mostly all zeroes

0 Answers0

Linked