331

Why does zero_grad() need to be called during training?

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
user1424739
  • 11,937
  • 17
  • 63
  • 152

7 Answers7

467

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropagation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behavior is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.

Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Otherwise, the gradient would be a combination of the old gradient, which you have already used to update your model parameters and the newly-computed gradient. It would therefore point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).

Here is a simple example:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

Alternatively, if you're doing a vanilla gradient descent, then:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

Note:

  • The accumulation (i.e., sum) of gradients happens when .backward() is called on the loss tensor.
  • As of v1.7.0, Pytorch offers the option to reset the gradients to None optimizer.zero_grad(set_to_none=True) instead of filling them with a tensor of zeroes. The docs claim that this setting reduces memory requirements and slightly improves performance, but might be error-prone if not handled carefully.
kmario23
  • 57,311
  • 13
  • 161
  • 150
  • 4
    thank you very much, this is really helpful! Do you happen to know whether the tensorflow has the behaviour? – layser Oct 06 '19 at 09:33
  • 1
    Just to be sure.. if you don't do this, then you will run into an exploding gradient problem, right? – zwep Dec 13 '19 at 10:59
  • 26
    @zwep If we accumulate gradients, it doesn't mean their magnitude increases: an example would be if the sign of the gradient keeps flipping. So it wouldn't guarantee you'd run into the exploding gradient problem. Besides, exploding gradients exist even if you zero correctly. – Tom Roth Apr 14 '20 at 05:18
  • When you run the vanilla gradient descent do you not get a "leaf Variable that requires grad has been used in an in-place operation" error when you try to update the weights? – MUAS Jul 09 '20 at 18:17
  • In other words , its done to set the variable delta_w and delta_b back to zero. – StanGeo Apr 15 '21 at 10:33
  • 1
    A follow-up question on this: so you're saying we shouldn't call optimizer.zero_grad() when training RNN models such as LSTM, for example? – Loqz May 20 '21 at 09:40
  • 1
    Why `optimizer.zero_grad()` is before `output = linear_model(sample, W, b)` ? – mrgloom Jun 03 '21 at 15:52
  • Thanks for the explanation. I have an additional question: What happens if I have two networks net_A, and net_B which are interconnected? If I set `net_B.parameters()[i].requires_grad = False`, and then compute the gradient w.r.t `net_A`, Would the gradients of `net_A.parameters()` be affected by nonsense values stored in `net_B.parameters()`? – C-3PO Jun 06 '21 at 22:33
  • 1
    Can someone answer @Loqz's question? I'm wondering about that too. Do you need to call `zero_grad()` when training an **RNN**? – Alaa M. Jul 30 '21 at 18:51
  • do you do this for validation as well or only train? I also have loss.backward in validation step. I understand we shouldn't do this in test phase since it is only one epoch. – Mona Jalal Dec 08 '21 at 00:17
  • Is this the same gradient as in `gradient decent` please thanks. – Alain Michael Janith Schroter Apr 01 '23 at 09:55
  • @AlaaM. `.zero_grad` needs to be called at some point to [throw away outdated information](https://stackoverflow.com/a/76645793/365102). – Mateen Ulhaq Jul 09 '23 at 04:20
23

Although the idea can be derived from the chosen answer, I feel like I want to write that explicitly.

Being able to decide when to call optimizer.zero_grad() and optimizer.step() provides more freedom on how gradients are accumulated and applied by the optimizer in the training loop. This is crucial when the model or input data is big and one training batch do not fit on the GPU.

Here in this example, there are two arguments, named train_batch_size and gradient_accumulation_steps.

  • train_batch_size is the batch size for the forward pass, following the loss.backward(). This is limited by the gpu memory.

  • gradient_accumulation_steps is the actual training batch size, where loss from multiple forward pass is accumulated. This is NOT limited by the gpu memory.

From this example, you can see how optimizer.zero_grad() may followed by optimizer.step() but NOT loss.backward(). loss.backward() is invoked in every single iteration (line 216) but optimizer.zero_grad() and optimizer.step() is only invoked when the number of accumulated train batch equals the gradient_accumulation_steps (line 227 inside the if block in line 219)

Also, someone is asking about the equivalent method in TensorFlow. I guess tf.GradientTape serves the same purpose.

Michael Mior
  • 28,107
  • 9
  • 89
  • 113
jerryIsHere
  • 361
  • 2
  • 4
  • This relates to training large models with limited GPU memory. Your ideas are expanded on in this nice post: https://towardsdatascience.com/i-am-so-done-with-cuda-out-of-memory-c62f42947dca – Under-qualified NASA Intern Oct 25 '21 at 22:39
3

zero_grad() restarts looping without losses from the last step if you use the gradient method for decreasing the error (or losses).

If you do not use zero_grad() the loss will increase not decrease as required.

For example:

If you use zero_grad() you will get the following output:

model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2

If you do not use zero_grad() you will get the following output:

model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5
Divya Bansal
  • 85
  • 1
  • 8
Youssri Abo Elseod
  • 671
  • 1
  • 9
  • 23
  • 7
    This is confusing to say the least. What looping gets restarted? Loss increase/decrease is affected indirectly, it can increase when you do `.zero_grad()` and it can decrease when you don't. Where are the outputs you're showing coming from? – dedObed Feb 16 '21 at 09:20
  • dear dedObed (this example for if you remove zero_grad from your correctly code), we talk about .zero_grad() function , this function only is start looping without the last result ، if loss is increasing is increasing you should be review your input ( write your problem in new topic and git me the link. – Youssri Abo Elseod Feb 16 '21 at 09:32
  • 7
    I (think I) do understand PyTorch well enough. I'm just pointing out what I perceive as flaws in you answer -- it's not clear, drawing quick conclusions, showing outputs who-knows-of-what. – dedObed Feb 16 '21 at 09:40
3

Why gradients?

The gradients suggest to the optimizer what direction to step in. Every time you process a batch of inputs with .backward(), you accumulate "suggestions" of where to step. Notice that a suggestion is much weaker than a decision. When you call optimizer.step(), the optimizer uses these suggestions to make actual decisions of where to actually step. These decisions may be influenced by the learning rate, past steps (e.g. momentum), and past weights (e.g. SWA). The optimizer reads the suggestions and then steps in a direction that it hopes will minimize future losses.

loss.backward()        # Compute gradients.
optimizer.step()       # Tell the optimizer the gradients, then step.
optimizer.zero_grad()  # Zero the gradients to start fresh next time.

Why zero the gradients?

Once you've completed a step, you don't really need to keep track of your previous suggestion (i.e. gradients) of where to step. By zeroing the gradients, you are throwing away this information. Some optimizers already keep track of this information automatically and internally.

With the next batch of inputs, you begin from a clean slate to suggest where to step next. This suggestion is pure and not influenced by the past. You then feed this "pure" information to the optimizer, which then decides exactly where to step.

Of course, you can decide to hold onto previous gradients, but that information is somewhat outdated since you're in an entirely new spot on the loss surface. Who is to say that the best direction to go next is still the same as the previous? It might be completely different! That's why most popular optimization algorithms throw most of that outdated information away (by zeroing the gradients).



Another alternative: Deleting gradients completely (instead of zeroing)

Instead of zeroing the gradients, you can also delete them entirely. The PyTorch performance tuning guide suggests:

# INSTEAD OF:
model.zero_grad()
# or
optimizer.zero_grad()
# CONSIDER:
for param in model.parameters():
    param.grad = None

...but one of the developers mentions this in a comment from 5 years ago:

The main difference is that the Tensor containing the gradients will not be reallocated at every backward pass. Since memory allocation is quite expensive (especially on GPU), this is much more efficient.

There are other subtle differences between the two like some optimizers that behave differently if a gradient is 0 or None. It am sure there are other places that behave like that.

...On the other hand, inplace operations are usually not considered necessary or even suboptimal in some cases, so I guess YMMV w.r.t. the performance of either method.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
1

In simple terms We need ZERO_GRAD

because when we start a training loop, we do not want past gardients or past results to interfere with our current results beacuse how PyTorch works as it collects/accumulates the gradients on backpropagation and if the past results may mixup and give us the wrong results so we set the gradient to zero every time we go through the loop. Here is a example:


    # let us write a training loop
    torch.manual_seed(42)
    
    epochs = 200
    for epoch in range(epochs):
      model_1.train()
    
      y_pred = model_1(X_train)
    
      loss = loss_fn(y_pred,y_train)
    
      optimizer.zero_grad()
    
      loss.backward()

      optimizer.step()

In this for loop, if we do not set the optimizer to zero every time the past value it may get add up and changes the result. So we use zero_grad to not face the wrong accumulated results.

Oren
  • 976
  • 9
  • 23
0

You don't have to call grad_zero() alternatively one can decay the gradients for example:

optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
    for p in group['params']:
        if p.grad is not None:
            ''' original code from git:
            if set_to_none:
                p.grad = None
            else:
                if p.grad.grad_fn is not None:
                    p.grad.detach_()
                else:
                    p.grad.requires_grad_(False)
                p.grad.zero_()
                
            '''
            p.grad = p.grad / 2

this way the learning is much more continues

DataYoda
  • 771
  • 5
  • 18
0

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!