1

I am trying to extract BERT embeddings and reproduce this code using tensorflow instead of pytorch. I know tf.stop_gradient() is the equivalent of torch.no_grad() but what about model.eval() / combination of both ?

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers. 
with torch.no_grad():

    outputs = model(tokens_tensor, segments_tensors)

    # Evaluating the model will return a different number of objects based on 
    # how it's  configured in the `from_pretrained` call earlier. In this case, 
    # becase we set `output_hidden_states = True`, the third item will be the 
    # hidden states from all layers. See the documentation for more details:
    # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
    hidden_states = outputs[2]
Ivan
  • 34,531
  • 8
  • 55
  • 100

1 Answers1

2

TLDR; eval and no_grad are two completely different things but will often be used in conjunction, primarily for performing fast inference in the case of evaluation/testing loops.

The nn.Module.eval function is applied on a PyTorch module and gives it the ability to change its behaviour depending on the stage type: training or evaluation. Only for a handful of layers does this actually have an effect on the layer. Functions such as dropout layers and normalization layers have different behaviours depending on whether they are in training or evaluation mode. You can read more about it on this thread.

The torch.no_grad utility is however a context manager, it changes the way the code contained inside that scope runs. When applied no_grad has the effect of preventing gradient computation. In practice, this means no layer activation is been cached in memory. This is most generally used for evaluation and testing loops where no backpropagation is expected after an inference. However, it can also be used during training, for example when an inference on a frozen component and the gradient is not required to pass through it.

Ivan
  • 34,531
  • 8
  • 55
  • 100
  • Can we say that `eval` enables `no_grad` by nature? – The Exile Jun 23 '22 at 11:36
  • No `eval` and `no_grad` are two independent things. You can still compute gradients when evaluation mode is on! Sometimes you want a particular component to be on eval mode **eventhoug you're in training mode**, this is often the case when freezing components... – Ivan Jun 23 '22 at 13:35
  • I see, but does not "freezing some components" mean computing no gradient for them? I mean, let's say we have a pre-trained feature extractor model for an image and we put a classifier layer on top of it. If we just want to fine-tune the classifier, then we should use `torch.no_grad` for feature extractor, right? – The Exile Jun 23 '22 at 14:26
  • 1
    Yes this would be the case, you would freeze your network with `requires_grad_(False)` or a `torch.no_grad()` context. *Additionally*, you might also want to switch your feature extractor to eval mode (this would deactivate running stats for normalization layers and the stochasticity of dropout layers). Do you see the difference? There are scenarios when you might such need `torch.no_grad` alone, and others when you need both. – Ivan Jun 23 '22 at 14:45