Pytorch model.train() and model.eval() behave in a weird way

Question

My model is a CNN based one with multiple BN layers and DO layers. So originally, I accidentally put model.train() outside of the loop just like the following:

model.train()
for e in range(num_epochs):
    # train model
    model.eval()
    # eval model

For the record, the code above trained well and performed decently on validation set:

[CV:02][E:001][I:320/320] avg. Loss: 0.460897, avg. Acc: 0.742746, test. acc: 0.708046(max: 0.708046)
[CV:02][E:002][I:320/320] avg. Loss: 0.389883, avg. Acc: 0.798791, test. acc: 0.823563(max: 0.823563)
[CV:02][E:003][I:320/320] avg. Loss: 0.319034, avg. Acc: 0.825559, test. acc: 0.834914(max: 0.834914)
[CV:02][E:004][I:320/320] avg. Loss: 0.301322, avg. Acc: 0.834254, test. acc: 0.834052(max: 0.834914)
[CV:02][E:005][I:320/320] avg. Loss: 0.292184, avg. Acc: 0.839575, test. acc: 0.835201(max: 0.835201)
[CV:02][E:006][I:320/320] avg. Loss: 0.285467, avg. Acc: 0.842266, test. acc: 0.837931(max: 0.837931)
[CV:02][E:007][I:320/320] avg. Loss: 0.279607, avg. Acc: 0.844917, test. acc: 0.829885(max: 0.837931)
[CV:02][E:008][I:320/320] avg. Loss: 0.275252, avg. Acc: 0.846443, test. acc: 0.827874(max: 0.837931)
[CV:02][E:009][I:320/320] avg. Loss: 0.270719, avg. Acc: 0.848150, test. acc: 0.822989(max: 0.837931)

While reviewing the code, however, I realized I made a mistake because the code above would turn off BN layers and DO layers after the first iteration.

So, I moved the line: model.train() inside the loop:

for e in range(num_epochs):
    model.train()
    #train model
    model.eval()
    #eval model

At this point, the model was learning relatively poorly(seemed like the model overfit as you can see in the following output). It had higher training accuracy, but significantly lower accuracy on validation set(it was starting to get weird at this point considering the usual effect of BN and DO):

[CV:02][E:001][I:320/320] avg. Loss: 0.416946, avg. Acc: 0.750477, test. acc: 0.689080(max: 0.689080)
[CV:02][E:002][I:320/320] avg. Loss: 0.329121, avg. Acc: 0.798992, test. acc: 0.690948(max: 0.690948)
[CV:02][E:003][I:320/320] avg. Loss: 0.305688, avg. Acc: 0.829053, test. acc: 0.719540(max: 0.719540)
[CV:02][E:004][I:320/320] avg. Loss: 0.290048, avg. Acc: 0.840539, test. acc: 0.741954(max: 0.741954)
[CV:02][E:005][I:320/320] avg. Loss: 0.279873, avg. Acc: 0.848872, test. acc: 0.745833(max: 0.745833)
[CV:02][E:006][I:320/320] avg. Loss: 0.270934, avg. Acc: 0.854274, test. acc: 0.742960(max: 0.745833)
[CV:02][E:007][I:320/320] avg. Loss: 0.263515, avg. Acc: 0.856945, test. acc: 0.741667(max: 0.745833)
[CV:02][E:008][I:320/320] avg. Loss: 0.256854, avg. Acc: 0.858672, test. acc: 0.734483(max: 0.745833)
[CV:02][E:009][I:320/320] avg. Loss: 0.252013, avg. Acc: 0.861363, test. acc: 0.723707(max: 0.745833)
[CV:02][E:010][I:320/320] avg. Loss: 0.245525, avg. Acc: 0.865519, test. acc: 0.711494(max: 0.745833)

So I thought to myself: "I guess BN layers and DO layers are having negative effects in my model" and hence, removed them. However, the model didn't perform well with BN and DO layers removed either(In fact, the model didn't seem to be learning anything):

[CV:02][E:001][I:320/320] avg. Loss: 0.552687, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:002][I:320/320] avg. Loss: 0.506234, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:003][I:320/320] avg. Loss: 0.503373, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:004][I:320/320] avg. Loss: 0.502966, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:005][I:320/320] avg. Loss: 0.502870, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:006][I:320/320] avg. Loss: 0.502832, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:007][I:320/320] avg. Loss: 0.502800, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:008][I:320/320] avg. Loss: 0.502765, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)

I was very confused at this point. I went further and carried out another experiment. I placed BN layers and DO layers back into the model and tested the following:

for e in range(num_epochs):
    model.eval()
    # train model
    # eval model

It worked poorly:

[CV:02][E:001][I:320/320] avg. Loss: 0.562196, avg. Acc: 0.744774, test. acc: 0.689080(max: 0.689080)
[CV:02][E:002][I:320/320] avg. Loss: 0.506071, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:003][I:320/320] avg. Loss: 0.503234, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:004][I:320/320] avg. Loss: 0.502916, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:005][I:320/320] avg. Loss: 0.502859, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)
[CV:02][E:006][I:320/320] avg. Loss: 0.502838, avg. Acc: 0.749071, test. acc: 0.689080(max: 0.689080)

I performed the above experiments multiple times and the results were not far off the outputs I have posted above. (Data I'm working with is pretty simple).

To summarize, the model works best in a very special setting.

Batch Normalization and Dropout added to the model. (This is good).
Train the model with model.train() on only for the first epoch. (Weird.. combined with 3)
Train the model with model.eval() on for the rest of the iterations. (Weird as well)

To be honest, I wouldn't have set up the training procedure as above(I don't think anyone would), but it works well for some reason. Has anybody experienced anything similar? or if you could guide me as to why the model behaves this way, it would be much appreciated!!

Thanks in advance!!

is your data [normalized](https://stackoverflow.com/a/57252898/1714410)? — Shai, Oct 22 '19 at 11:34
@Shai yes, it's 0-1 normalized. Besides, even if it weren't, model.eval() mode should behave the same way as the one with BN layers and DO layers removed, but it doesn't — subbie, Oct 22 '19 at 14:13
`eval()` does not remove BN layers, but rather fix the normalization to values obtained during training. — Shai, Oct 22 '19 at 14:17
Can you check which layer (specific BN, or specific DO) is responsible for this behavior? Can you repeat these experiments adding one "suspicious" layer at a time? — Shai, Oct 22 '19 at 14:19
During `train()` the batch norm layers use only the current batch's statistics to normalize the data. It could be that the batch size it too small so the normalization isn't effective. As @shai pointed out the batch norm isn't removed during `eval` mode, simply the running statistics are used instead of batch statistics and the running statistics aren't updated. — jodag, Oct 22 '19 at 22:04
@jodag (1) If the normalization is not effective, removing the batch norm layers shouldn't affect much, but it does as you can see in the 2nd experiment. (2) If using stats of THIS batch and NOT updating running stats is preferred in terms of training, setting eval() before training should work well, but it doesn't either. (3) It seems like using model.train() ONLY for the first iteration performs well and that's where I got lost. — subbie, Oct 22 '19 at 23:03
@Shai Thanks, I only meant to say ineffective dropout in eval mode(not batchnorm). I'll try to add those layers one by one and post updates! — subbie, Oct 22 '19 at 23:07
@subbie I never said batch norm wasn't effective, I'm saying the result of your first experiment doesn't imply that. Running batch norm in `eval` mode without any training is going to use the default values for the running mean and running variance (not sure what these are by default). If you instead run one epoch of training, the batch norm will use the running statistics learned in epoch 1 for the remainder of training and will no longer update running stats after epoch 1. — jodag, Oct 22 '19 at 23:20
@jodag I apologize for the unclear statement. I understand that you were not saying that the normalization was ineffective. After reading your last comment, I understand the difference between running eval() mode from the start and running 1 epoch of train() mode and eval() thereafter. Thanks! but I still don't understand how updating running mean and variance in every epoch is worse than updating them once and use them for the rest of the training. Do you have any insights? — subbie, Oct 23 '19 at 01:36

Pytorch model.train() and model.eval() behave in a weird way

0 Answers0