Strange behaviour of the loss function in keras model, with pretrained convolutional base

Question

I'm trying to create a model in Keras to make numerical predictions from the pictures. My model has densenet121 convolutional base, with couple of additional layers on top. All layers except for the two last ones are set to layer.trainable = False. My loss is mean squared error, since it's a regression task. During training I get loss: ~3, while evaluation on the very same batch of the data gives loss: ~30:

model.fit(x=dat[0],y=dat[1],batch_size=32)

Epoch 1/1 32/32 [==============================] - 0s 11ms/step - loss: 2.5571

model.evaluate(x=dat[0],y=dat[1])

32/32 [==============================] - 2s 59ms/step 29.276123046875

I feed exactly the same 32 pictures during training and evaluation. And I also calculated loss using predicted values from y_pred=model.predict(dat[0]) and then constructed mean squared error using numpy. The result was the same as what I've got from evaluation (i.e. 29.276123...).

There was suggestion that this behavior might be due to BatchNormalization layers in convolutional base (discussion on github). Of course, all BatchNormalization layers in my model have been set to layer.trainable=False as well. Maybe somebody has encountered this problem and figured out the solution?

Does your model include `Dropout` or `BatchNormalization` layers? If it has Dropout layers, then it's likely the cause of difference. — today, Jul 01 '18 at 12:28
Yes, it does. I have one trainable dropout layer in my model. But dropout layers usually create opposite effect making loss on evaluation less than loss during training. Also it usually does not create such a big difference of an order of magnitude. — Andrey Kite Gorin, Jul 01 '18 at 12:33
Not necessarily! Although in dropout layer some of the neurons are dropped, but bear in mind that the output is scaled back according to dropout rate. In inference time (i.e. test time) dropout is removed entirely and considering that you have only trained your model for just one epoch, the behavior you saw may happen. Experiment it yourself: just set the `trainable` parameter of `Dropout` layer(s) to `False` and see whether this happens or not. — today, Jul 01 '18 at 13:18

score 13 · Accepted Answer · edited Nov 17 '20 at 20:53

13

Looks like I found the solution. As I have suggested the problem is with BatchNormalization layers. They make tree things

subtract mean and normalize by std
collect statistics on mean and std using running average
train two additional parameters (two per node).

When one sets trainable to False, these two parameters freeze and layer also stops collecting statistic on mean and std. But it looks like the layer still performs normalization during training time using the training batch. Most likely it's a bug in keras or maybe they did it on purpose for some reason. As a result the calculations on forward propagation during training time are different as compared with prediction time even though the trainable atribute is set to False.

There are two possible solutions i can think of:

To set all BatchNormalization layers to trainable. In this case these layers will collect statistics from your dataset instead of using pretrained one (which can be significantly different!). In this case you will adjust all the BatchNorm layers to your custom dataset during the training.
Split the model in two parts model=model_base+model_top. After that, use model_base to extract features by model_base.predict() and then feed these features into model_top and train only the model_top.

I've just tried the first solution and it looks like it's working:

model.fit(x=dat[0],y=dat[1],batch_size=32)

Epoch 1/1
32/32 [==============================] - 1s 28ms/step - loss: **3.1053**

model.evaluate(x=dat[0],y=dat[1])

32/32 [==============================] - 0s 10ms/step
**2.487905502319336**

This was after some training - one need to wait till enough statistics on mean and std are collected.

Second solution i haven't tried yet, but i'm pretty sure it's gonna work since forward propagation during training and prediction will be the same.

Update. I found a great blog post where this issue has been discussed in all the details. Check it out here

edited Nov 17 '20 at 20:53

roschach

8,390
14
74
124

answered Jul 01 '18 at 15:11

Andrey Kite Gorin

1,030
1
9
23

Actually, I don't know what I am missing here: Is `3.1053` the loss of last epoch during training? If this is the case, it should be the same as the evaluation loss on the same data, right? But it is not: `3.1053 != 2.4879`. Why? – today Jul 01 '18 at 16:05
yes, one is after one step of training (1 epoch with batch_size=data_set_size) and another one evaluation. They are different for two reasons: 1- the one you mentioned earlier in your comment about dropout layer 2- not enough statistics has been collected in BatchNormalization layers that are set to trainable now. But now it is logical that with dropout loss is higher than without. I'm training the model further and will see the dynamics of this difference – Andrey Kite Gorin Jul 01 '18 at 16:17
Just to clarify, this training step that you see here i perform just for testing purposes. The real training is done separately on big dataset with more then 300000 samples. – Andrey Kite Gorin Jul 01 '18 at 16:23
It does not matter. If it says that the loss on this batch is equal to say 3 (we have one single batch in this case), then if we immediately evaluate the model on that batch the loss should be exactly 3, right? Or maybe I am missing something here... – today Jul 01 '18 at 16:27
During training the dropout is done, but during evaluation not. Also BatchNorm Layers computed differently on the evaluation forward propagation and during training. So such a small difference in losses is not that strange. Strange was the difference of the order of magnitude that i had in the begining. – Andrey Kite Gorin Jul 01 '18 at 16:34
I tried setting the batch normalization layers to trainable without success as explained in [the update section of my question](https://stackoverflow.com/questions/55569181/why-is-accuracy-from-fit-generator-different-to-that-from-evaluate-generator-in). Did I do something wrong? Also could you provide en example for your second suggestion? (_2. Split the model in two parts_) – Sophie Crommelinck Apr 11 '19 at 12:15
i've read your issue and i think it's not related to this one. The difference of the accuracy you observe is small, it could be due to dropout layers that are switched of during evaluation or some other reason. – Andrey Kite Gorin Apr 11 '19 at 12:47

today · Answer 2 · 2018-07-01T17:17:46.580

But dropout layers usually create opposite effect making loss on evaluation less than loss during training.

Not necessarily! Although in dropout layer some of the neurons are dropped, but bear in mind that the output is scaled back according to dropout rate. In inference time (i.e. test time) dropout is removed entirely and considering that you have only trained your model for just one epoch, the behavior you saw may happen. Don't forget that since you are training the model for just one epoch, only a portion of neurons have been dropped in the dropout layer but all of them are present at inference time.

If you continue training the model for more epochs you might expect that the training loss and the test loss (on the same data) becomes more or less the same.

Experiment it yourself: just set the trainable parameter of Dropout layer(s) to False and see whether this happens or not.

One may be confused (as I was) by seeing that, after one epoch of training, the training loss is not equal to evaluation loss on the same batch of data. And this is not specific to models with Dropout or BatchNormalization layers. Consider this example:

from keras import layers, models
import numpy as np

model = models.Sequential()
model.add(layers.Dense(1000, activation='relu', input_dim=100))
model.add(layers.Dense(1))

model.compile(loss='mse', optimizer='adam')
x = np.random.rand(32, 100)
y = np.random.rand(32, 1)

print("Training:")
model.fit(x, y, batch_size=32, epochs=1)

print("\nEvaluation:")
loss = model.evaluate(x, y)
print(loss)

The output:

Training:
Epoch 1/1
32/32 [==============================] - 0s 7ms/step - loss: 0.1520

Evaluation:
32/32 [==============================] - 0s 2ms/step
0.7577340602874756

So why the losses are different if they have been computed over the same data, i.e. 0.1520 != 0.7577?

If you ask this, it's because you, like me, have not paid enough attention: that 0.1520 is the loss before updating the parameters of model (i.e. before doing backward pass or backpropagation). And 0.7577 is the loss after the weights of model has been updated. Even though that the data used is the same, the state of the model when computing those loss values is not the same (Another question: so why has the loss increased after backpropagation? It is simply because you have only trained it for just one epoch and therefore the weights updates are not stable enough yet).

To confirm this, you can also use the same data batch as the validation data:

model.fit(x, y, batch_size=32, epochs=1, validation_data=(x,y))

If you run the code above with the modified line above you will get an output like this (obviously the exact values may be different for you):

Training:
Train on 32 samples, validate on 32 samples
Epoch 1/1
32/32 [==============================] - 0s 15ms/step - loss: 0.1273 - val_loss: 0.5344

Evaluation:
32/32 [==============================] - 0s 89us/step
0.5344240665435791

You see that the validation loss and evaluation loss are exactly the same: it is because the validation is performed at the end of epoch (i.e. when the model weights has already been updated).

I will try this out. But i did not trained my model just 1 epoch. What I wrote in my question, was just a test phase. I've made it after training the model for some time. During the training loss went down from ~1000 to ~1, and was pretty stable. And only after that I started to evaluate. That line with model.fit() I did just for testing purpose. — Andrey Kite Gorin, Jul 01 '18 at 14:00
@AndreyKiteGorin read the note I added to my post. I couldn't confirm my claims... — today, Jul 01 '18 at 14:01
yes, i understand. I will try it, and see what happens. Thank you, for looking into the problem — Andrey Kite Gorin, Jul 01 '18 at 14:04
That all is correct. But it's not related to my case, since I do this single training step after I have been already training the model for a long time and seen that loss become stable during training. The problem was that this stable loss, that I see during training, is very much different from the one that i get on evaluation. Sorry, if my description of the problem was confusing. — Andrey Kite Gorin, Jul 01 '18 at 17:25

Strange behaviour of the loss function in keras model, with pretrained convolutional base

2 Answers2

Linked