I'm following a PyTorch tutorial which uses the BERT NLP model (feature extractor) from the Huggingface Transformers library. There are two pieces of interrelated code for gradient updates that I don't understand.
(1) torch.no_grad()
The tutorial has a class where the forward()
function creates a torch.no_grad()
block around a call to the BERT feature extractor, like this:
bert = BertModel.from_pretrained('bert-base-uncased')
class BERTGRUSentiment(nn.Module):
def __init__(self, bert):
super().__init__()
self.bert = bert
def forward(self, text):
with torch.no_grad():
embedded = self.bert(text)[0]
(2) param.requires_grad = False
There is another portion in the same tutorial where the BERT parameters are frozen.
for name, param in model.named_parameters():
if name.startswith('bert'):
param.requires_grad = False
When would I need (1) and/or (2)?
- If I want to train with a frozen BERT, would I need to enable both?
- If I want to train to let BERT be updated, would I need to disable both?
Additionaly, I ran all four combinations and found:
with torch.no_grad requires_grad = False Parameters Ran
------------------ --------------------- ---------- ---
a. Yes Yes 3M Successfully
b. Yes No 112M Successfully
c. No Yes 3M Successfully
d. No No 112M CUDA out of memory
Can someone please explain what's going on? Why am I getting CUDA out of memory
for (d) but not (b)? Both have 112M learnable parameters.