Should dropout be deactivated when training a model with some freezed modules?

Question

I have a deep neural network made of a combination o modules, such as an encoder, a decoder, etc. Before training, I load a part of its parameters from a pretrained model, just for a subset of modules. For instance, I could load a pretrained encoder. Then I want to freeze the parameters of the pretrained modules so that they are not trained with the rest. In Pytorch:

for param in submodel.parameters()
     param.requires_grad = False

Now, should I keep applying dropout to these freezed modules while learning or should I deactivate it (see example below) ? Why?

def MyModel(nn.Module):
    ...
    def forward(x):
        if freeze_submodule:
            self.submodule.eval()  # disable dropout when submodule is frozen
        x = self._forward(x)
        if freeze_submodule:
            self.submodule.train()

I don't really understand your question. Why do you want to freeze the submodules? What are you trying to accomplish in the end? — Victor Zuanazzi, Jul 29 '20 at 14:41
@VictorZuanazzi freezing pretrained submodules is useful to avoid their weights being messed up by the gradients that will result from training non-pretrained submodules. See, e.g., [Jean et al 2015](https://www.aclweb.org/anthology/P15-1001.pdf) or [Zoph et al., 2016](https://www.aclweb.org/anthology/D16-1163.pdf). — Lollo, Jul 30 '20 at 09:20

Szymon Maszke · Accepted Answer · 2020-07-30T12:11:10.533

Freezing module

You can freeze parameters by setting requires_grad_(False), which is less verbose:

submodel.requires_grad_(False)

This will freeze all submodel parameters.

You could also use with torch.no_grad context manager over submodel forward pass but it is less common indeed.

`eval`

Running submodule.eval() puts certain layers in evaluation mode (BatchNorm or Dropout). For Dropout (inverted dropout actually) you can check how it works in this answer.

Q: should dropout still be applied to freezed parameters?

No, as the weights will be unable to compensate dropout's effect which is one of it's goals (to make it more robust and spread information flow across more paths). They will be unable to do it as they are untrainable.

On the other hand, leaving dropout would add more noise and error to the architecture and might force your trainable part of the network to compensate for it, I'd go for experimenting.

freezing pretrained submodules is useful to avoid their weights being messed up by the gradients that will result from training non-pretrained submodules

Depends, fastai community uses smaller learning rates for pretrained modules, still leaving them trainable (see this blog post for example), which makes intuitive sense (task's distribution is somehow different than the one your backbone was pretrained, hence it's reasonable to assume weights need to be adjusted by some amount (possibly small) as well).

Thanks @Szymon, I suspected the same. However, searching for "freezing parameters" on the Pytorch forum I uniquely found answers proposing to explicitly freeze the parameters with `requires_grad=False`. E.g., [this post](https://discuss.pytorch.org/t/best-practice-for-freezing-layers/58156) or [this one](https://discuss.pytorch.org/t/how-the-pytorch-freeze-network-in-some-layers-only-the-rest-of-the-training/7088). Any intuition why? — Lollo, Jul 30 '20 at 09:31
My question above is more generic though: should dropout still be applied to freezed parameters? if so, why? do you have some references please? — Lollo, Jul 30 '20 at 09:35
@Lollo Got the answer a little mixed up. This one is corrected, sorry for inconvenience. — Szymon Maszke, Jul 30 '20 at 12:12

Should dropout be deactivated when training a model with some freezed modules?

1 Answers1

Freezing module

eval

`eval`