103

What is the correct way to perform gradient clipping in pytorch?

I have an exploding gradients problem.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Gulzar
  • 23,452
  • 27
  • 113
  • 201
  • https://discuss.pytorch.org/t/proper-way-to-do-gradient-clipping/191 – p13rr0m Feb 15 '19 at 20:23
  • 8
    @pierrom Thanks. I found that thread myself. Thought that asking here would save everyone who comes after me and googles for a quick answer the hassle of reading through all the discussion (which I haven't finished yet myself), and just getting a quick answer, stackoverflow style. Going to forums to find answers reminds me of 1990. If no one else posts the answer before me, then I will once I find it. – Gulzar Feb 15 '19 at 20:26

4 Answers4

175

A more complete example from here:

optimizer.zero_grad()        
loss, hidden = model(data, hidden, targets)
loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
optimizer.step()
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Rahul
  • 3,220
  • 4
  • 22
  • 28
  • 3
    Why is this more complete? I see the more votes, but don't really understand why this is better. Can you explain please? – Gulzar Oct 28 '20 at 11:26
  • 17
    This simply follows a popular pattern, where one can insert torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip) between the loss.backward() and optimizer.step() – Rahul Oct 29 '20 at 15:33
  • 14
    what is args.clip? – Farhang Amaji Dec 03 '21 at 11:45
  • 1
    does it matter if you call `opt.zero_grad()` before the forward pass or not? My guess is that the sooner it's zeroed out perhaps the sooner MEM freeing happens? – Charlie Parker Jan 21 '22 at 20:02
  • 6
    @FarhangAmaji the `max_norm` (clipping threshold) value from the `args` (perhaps from `argparse` module) – vdi Jan 28 '22 at 06:45
  • For "args.clip" you can use 0.01; e.g., torch.nn.utils.clip_grad_norm_(model.parameters(), 0.01) – russian_spy May 27 '22 at 20:49
104

clip_grad_norm (which is actually deprecated in favor of clip_grad_norm_ following the more consistent syntax of a trailing _ when in-place modification is performed) clips the norm of the overall gradient by concatenating all parameters passed to the function, as can be seen from the documentation:

The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

From your example it looks like that you want clip_grad_value_ instead which has a similar syntax and also modifies the gradients in-place:

clip_grad_value_(model.parameters(), clip_value)

Another option is to register a backward hook. This takes the current gradient as an input and may return a tensor which will be used in-place of the previous gradient, i.e. modifying it. This hook is called each time after a gradient has been computed, i.e. there's no need for manually clipping once the hook has been registered:

for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))
Ivan
  • 34,531
  • 8
  • 55
  • 100
a_guest
  • 34,165
  • 12
  • 64
  • 118
  • 15
    It is worth mentioning here that these two approaches are NOT equivalent. The latter approach with registering a hook is definitely what most people want. The difference between these two approaches is that the latter approach clips gradients DURING backpropagation and the first approach clips gradients AFTER the entire backpropagation has taken place. – c0mr4t Feb 02 '22 at 23:30
  • 4
    And why do we want to clip the gradients DURING backpropagation not AFTER it? Trying to understand why the latter is more desirable than the first. – NikSp Jan 12 '23 at 17:02
  • 4
    @NikSp If you clip *during* backpropagation then the clipped gradients propagate to the upstream layers. Otherwise, the raw gradients propagate upstream and this might saturate the gradients for those upstream layers (if clipping would be performed *after* backpropagation). If the gradients of all layers saturate at the threshold (clip) value this might prevent convergence. – a_guest Jan 31 '23 at 20:48
  • 1
    Could you expand on how to make sure the latter does l2 norm clipping. It currently looks like it is simply clipping the absolute value of individual elements. Also does `register_hook` work only on gradients? Because I would have expected something like `param.grad`. TIA. – sachinruk Apr 05 '23 at 11:59
  • While registering a hook is a fine option, it doesn't seem like the hook in the answer is applying a norm clipping. It's clipping the individual elements rather than the norm of the elements of the gradient. – Shiania White Jul 27 '23 at 04:35
14

Reading through the forum discussion gave this:

clipping_value = 1 # arbitrary value of your choosing
torch.nn.utils.clip_grad_norm(model.parameters(), clipping_value)

I'm sure there is more depth to it than only this code snippet.

Nikita
  • 333
  • 3
  • 8
Gulzar
  • 23,452
  • 27
  • 113
  • 201
10

And if you are using Automatic Mixed Precision (AMP), you need to do a bit more before clipping as AMP scales the gradient:

optimizer.zero_grad()
loss = model(data, targets)
scaler.scale(loss).backward()

# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)

# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)

# Updates the scale for next iteration.
scaler.update()

Reference: https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-clipping

hkchengrex
  • 4,361
  • 23
  • 33