15

When facing difficulties during training (nans, loss does not converge, etc.) it is sometimes useful to look at more verbose training log by setting debug_info: true in the 'solver.prototxt' file.

The training log then looks something like:

I1109 ...]     [Forward] Layer data, top blob data data: 0.343971    
I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0
I1109 ...]     [Forward] Layer relu1, top blob conv1 data: 0.0337982
I1109 ...]     [Forward] Layer conv2, top blob conv2 data: 0.0249297
I1109 ...]     [Forward] Layer conv2, param blob 0 data: 0.00875855
I1109 ...]     [Forward] Layer conv2, param blob 1 data: 0
I1109 ...]     [Forward] Layer relu2, top blob conv2 data: 0.0128249
. 
.
.
I1109 ...]     [Forward] Layer fc1, top blob fc1 data: 0.00728743
I1109 ...]     [Forward] Layer fc1, param blob 0 data: 0.00876866
I1109 ...]     [Forward] Layer fc1, param blob 1 data: 0
I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506
I1109 ...]     [Backward] Layer fc1, bottom blob conv6 diff: 0.00107067
I1109 ...]     [Backward] Layer fc1, param blob 0 diff: 0.483772
I1109 ...]     [Backward] Layer fc1, param blob 1 diff: 4079.72
.
.
.
I1109 ...]     [Backward] Layer conv2, bottom blob conv1 diff: 5.99449e-06
I1109 ...]     [Backward] Layer conv2, param blob 0 diff: 0.00661093
I1109 ...]     [Backward] Layer conv2, param blob 1 diff: 0.10995
I1109 ...]     [Backward] Layer relu1, bottom blob conv1 diff: 2.87345e-06
I1109 ...]     [Backward] Layer conv1, param blob 0 diff: 0.0220984
I1109 ...]     [Backward] Layer conv1, param blob 1 diff: 0.0429201
E1109 ...]     [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)

What does it mean?

Community
  • 1
  • 1
Shai
  • 111,146
  • 38
  • 238
  • 371

1 Answers1

16

At first glance you can see this log section divided into two: [Forward] and [Backward]. Recall that neural network training is done via forward-backward propagation:
A training example (batch) is fed to the net and a forward pass outputs the current prediction.
Based on this prediction a loss is computed. The loss is then derived, and a gradient is estimated and propagated backward using the chain rule.

Caffe Blob data structure
Just a quick re-cap. Caffe uses Blob data structure to store data/weights/parameters etc. For this discussion it is important to note that Blob has two "parts": data and diff. The values of the Blob are stored in the data part. The diff part is used to store element-wise gradients for the backpropagation step.

Forward pass

You will see all the layers from bottom to top listed in this part of the log. For each layer you'll see:

I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

Layer "conv1" is a convolution layer that has 2 param blobs: the filters and the bias. Consequently, the log has three lines. The filter blob (param blob 0) has data

 I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114

That is the current L2 norm of the convolution filter weights is 0.00899.
The current bias (param blob 1):

 I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

meaning that currently the bias is set to 0.

Last but not least, "conv1" layer has an output, "top" named "conv1" (how original...). The L2 norm of the output is

 I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037

Note that all L2 values for the [Forward] pass are reported on the data part of the Blobs in question.

Loss and gradient
At the end of the [Forward] pass comes the loss layer:

I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506

In this example the batch loss is 2031.85, the gradient of the loss w.r.t. fc1 is computed and passed to diff part of fc1 Blob. The L2 magnitude of the gradient is 0.1245.

Backward pass
All the rest of the layers are listed in this part top to bottom. You can see that the L2 magnitudes reported now are of the diff part of the Blobs (params and layers' inputs).

Finally
The last log line of this iteration:

[Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)

reports the total L1 and L2 magnitudes of both data and gradients.

What should I look for?

  1. If you have nans in your loss, see at what point your data or diff turns into nan: at which layer? at which iteration?

  2. Look at the gradient magnitude, they should be reasonable. IF you are starting to see values with e+8 your data/gradients are starting to blow up. Decrease your learning rate!

  3. See that the diffs are not zero. Zero diffs mean no gradients = no updates = no learning. If you started from random weights, consider generating random weights with higher variance.

  4. Look for activations (rather than gradients) going to zero. If you are using "ReLU" this means your inputs/weights lead you to regions where the ReLU gates are "not active" leading to "dead neurons". Consider normalizing your inputs to have zero mean, add "BatchNorm" layers, setting negative_slope in ReLU.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • What would you change if diffs are zero? –  Nov 09 '16 at 17:14
  • 1
    @thigi it depends on the architecture: for instance, if you are using `ReLU` activations and the input `data` is zero - the gradient will be zero. Then you might consider changing the "working point" by e.g., subtracting mean. Alternatively, you can replace `ReLU` with `PReLU`... – Shai Nov 09 '16 at 17:18
  • @Shai: Why is L1/L2 norm is used here? – Hossein Dec 22 '16 at 09:54
  • @Hossein what else should caffe report? – Shai Dec 22 '16 at 12:10
  • @Shai: I have no idea!, I'm not coming from a AI/ML background, and thus alot of things dont make sense to me. thats why I ask (newbish?)questions like these – Hossein Dec 22 '16 at 14:26
  • 1
    @Hossein well, L1 represent the sum of absolute values of the parameters, while L2 is the sum of square values. L2 represents the "energy" of the weights. Both quantities represent the "magnitude" of the weights. – Shai Dec 22 '16 at 14:29
  • @Shai: are you sure with what you mentioned in the answer part in that "IF you are starting to see values with e+8 your data/gradients are starting to blow off."? (I mean regarding "e+8"). In my case I got something on the other extreme for the loss like 1.47348e-08 i.e. e- instead of e+. – alfa_80 Mar 02 '17 at 10:32
  • 1
    `e-8` means something in the order of `0.00000001`: your gradients are vanishing, not exploding – Shai Mar 02 '17 at 10:35
  • @Shai: Thanks. If the gradient is vanishing, it's not learning either, right? What can I do about it? I think decreasing the learning rate is not helpful. What do you think? Perhaps, the training set is not having a high variance (not diverse enough) ? – alfa_80 Mar 02 '17 at 10:51
  • what are 'bottom blob diff'? blobs are updated during feedforward step, diffs are derivatives wrt to weights. Is it the ds/dw term in the product of partial derivatives dE/dw = dE/dy *dy/ds *ds/dw? – Alex Jun 27 '17 at 16:03
  • @alfa_80 is the gradients vanishing for all layers? – Shai Jun 27 '17 at 16:17
  • 1
    @Shai: Yes, for all layers. But it's resolved, by now. – alfa_80 Jun 28 '17 at 10:27
  • @Shai: FYI, I was having that vanishing gradients before because I didn't use any pre-trained model trained on my data that would need lots of fine-tuning effort if I need to get it work. After using pre-trained model, the issue was resolved. – alfa_80 Jun 28 '17 at 10:46
  • @alfa_80 FYI, are you using `"ReLU"` activations? you may get zero gradients if most of your outputs are in the negative part of the `"ReLU"` - there is no gradient there. Using `"PReLU"` or "leaky" `"ReLU"` may resolve this issue. – Shai Jun 28 '17 at 10:49
  • @Shai: Yes, I had employed ReLU activations as well before but still problematic. Not yet tried with "PReLU" or "leaky" "ReLU" though, but it's a good point. – alfa_80 Jun 28 '17 at 10:54
  • @alfa_80 you might also want to look into [`SELU`](https://arxiv.org/pdf/1706.02515) activation. – Shai Jun 28 '17 at 10:56
  • 1
    @Shai: Thanks Shai. I have actually resolved the issue by using the pre-trained model for that kind of problem. But your point might be useful for others though. – alfa_80 Jun 28 '17 at 11:10
  • @Shai Can we set debug info during the TEST only? Also, is it possible to get per layer tensor output? – algoProg May 01 '18 at 01:11
  • @algoProg afaik the debug_info flag is global and cannot be changed between train and test phase. You can use Python interface for debugging, look at "net surgery" example for more details – Shai May 01 '18 at 01:24
  • @Shai Thanks for immediate reply. I asked the wrong question. I want to print out the layer outputs during the Classification. Is it possible? `debug_info` prints weights and biases, but not layer outputs themselves, and it cannot be set during the classification if I understand right. – algoProg May 01 '18 at 01:42
  • @algoProg do you mean you want to see the actual vector of class probabilities during test? – Shai May 01 '18 at 04:16
  • @Shai I want to print the "top" of the Blob (which I think is the output of each layer) during the classification. Specifically, looking at the cpp_classification (classification.cpp) example in Caffe, I want to print output vector of each layer (and not the features -- weights and biases, etc.). I do not want to mess with the Train/Test sequence, as I deal with the PreTrained model. So yes, for the last layer that would be the class probability vector, but that is not difficult as Classification yields that for me. But what about intermediate layers (Conv, etc.)? Is there an API for this? – algoProg May 01 '18 at 14:30
  • @algoProg do it in Python look at "net surgery" tutorial. if you have further questions, please post them as questions and not as comments – Shai May 01 '18 at 15:56