109

How do I add L1/L2 regularization in PyTorch without manually computing it?

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161

8 Answers8

92

Use weight_decay > 0 for L2 regularization:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
devil in the detail
  • 2,905
  • 17
  • 15
  • 3
    In SGD optimizer, L2 regularization can be obtained by `weight_decay`. But `weight_decay` and L2 regularization is different for Adam optimizer. More can be read here: https://openreview.net/pdf?id=rk6qdGgCZ – Ashish Dec 04 '21 at 05:46
  • 1
    @Ashish your comment is correct that `weight_decay` and L2 regularization is different but in the case of PyTorch's implementation of Adam, they actually implement L2 regularization instead of true weight decay. Note that the weight decay term is applied to the gradient before the optimizer step [here](https://github.com/pytorch/pytorch/blob/40d1f77384672337bd7e734e32cb5fad298959bd/torch/optim/_functional.py#L94) – Eric Wiener Jan 21 '22 at 16:18
  • 1
    How about L1 regularization? – Ynjxsjmh Apr 10 '23 at 07:35
80

See the documentation. Add a weight_decay parameter to the optimizer for L2 regularization.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Kashyap
  • 6,439
  • 2
  • 22
  • 21
  • 10
    Adagrad is an optimization technique, I am talking about regularization. Can you give me a concrete example with L1 and L2 loss? – Wasi Ahmad Mar 10 '17 at 20:34
  • 19
    Ya, the L2 regularisation is mysteriously added in the Optimization functions because loss functions are used during Optimization. You can find the discussion here https://discuss.pytorch.org/t/simple-l2-regularization/139/3 – Kashyap Mar 11 '17 at 03:42
  • 5
    I have some branches using L2 loss, so this is not useful. (I have different loss functions) – dashesy May 25 '18 at 20:59
  • 6
    What if I want use L1 or some other loss for regularization? – mrgloom Mar 03 '21 at 21:36
  • 1
    @mrgloom you can implement that yourself. It is not included with the optimizers. – Eric Wiener Jan 21 '22 at 16:19
45

Previous answers, while technically correct, are inefficient performance wise and are not too modular (hard to apply on a per-layer basis, as provided by, say, keras layers).

PyTorch L2 implementation

Why PyTorch implemented L2 inside torch.optim.Optimizer instances?

Let's take a look at torch.optim.SGD source code (currently as functional optimization procedure), especially this part:

for i, param in enumerate(params):
    d_p = d_p_list[i]
    # L2 weight decay specified HERE!
    if weight_decay != 0:
        d_p = d_p.add(param, alpha=weight_decay)
  • One can see, that d_p (derivative of parameter, gradient) is modified and re-assigned for faster computation (not saving the temporary variables)
  • It has O(N) complexity without any complicated math like pow
  • It does not involve autograd extending the graph without any need

Compare that to O(n) **2 operations, addition and also taking part in backpropagation.

Math

Let's see L2 equation with alpha regularization factor (same could be done for L1 ofc):

L2

If we take derivative of any loss with L2 regularization w.r.t. parameters w (it is independent of loss), we get:

L2 deriv

So it is simply an addition of alpha * weight for gradient of every weight! And this is exactly what PyTorch does above!

L1 Regularization layer

Using this (and some PyTorch magic), we can come up with quite generic L1 regularization layer, but let's look at first derivative of L1 first (sgn is signum function, returning 1 for positive input and -1 for negative, 0 for 0):

L1 derivative

Full code with WeightDecay interface located in torchlayers third party library providing stuff like regularizing only weights/biases/specifically named paramters (disclaimer: I'm the author), but the essence of the idea outlined below (see comments):

class L1(torch.nn.Module):
    def __init__(self, module, weight_decay):
        super().__init__()
        self.module = module
        self.weight_decay = weight_decay

        # Backward hook is registered on the specified module
        self.hook = self.module.register_full_backward_hook(self._weight_decay_hook)

    # Not dependent on backprop incoming values, placeholder
    def _weight_decay_hook(self, *_):
        for param in self.module.parameters():
            # If there is no gradient or it was zeroed out
            # Zeroed out using optimizer.zero_grad() usually
            # Turn on if needed with grad accumulation/more safer way
            # if param.grad is None or torch.all(param.grad == 0.0):

            # Apply regularization on it
            param.grad = self.regularize(param)

    def regularize(self, parameter):
        # L1 regularization formula
        return self.weight_decay * torch.sign(parameter.data)

    def forward(self, *args, **kwargs):
        # Simply forward and args and kwargs to module
        return self.module(*args, **kwargs)

Read more about hooks in this answer or respective PyTorch docs if needed.

And usage is also pretty simple (should work with gradient accumulation and and PyTorch layers):

layer = L1(torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3))

Side note

Also, as a side note, L1 regularization is not implemented as it does not actually induce sparsity (lost citation, it was some GitHub issue on PyTorch repo I think, if anyone has it, please edit) as understood by weights being equal to zero.

More often, weight values are thresholded (simply assigning zero value to them) if they reach some small predefined magnitude (say 0.001)

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
  • Would you like to make a new tag for `torchlayers` and release it with `L1` and `L2` because they are still missing in version 0.1.1 released more that 1 year ago? – Maxim Egorushkin May 25 '21 at 13:42
  • @MaximEgorushkin could you try the nightly release? It should be there although not thoroughly tested as of yet, new release is planned in the upcoming 2 months (together with other libraries) – Szymon Maszke May 25 '21 at 13:44
  • 1
    Nightly has `L1` and `L2`, thank you. There is a warning though `~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py:785: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.` – Maxim Egorushkin May 25 '21 at 13:51
  • @MaximEgorushkin See [this PR](https://github.com/szymonmaszke/torchlayers/pull/12), you can update yours version accordingly (if tests pass it will be released today @ 00:00 GMT). – Szymon Maszke May 25 '21 at 14:00
  • This looks great - I would like to ask: how come you assign the output of reguralize(param) to the gradient instead of of adding regularize(param) to the gradient? Does the necessary addition happen automatically? – stochasticmrfox Dec 15 '22 at 04:20
  • @stochasticmrfox If we were to add we might accumulate it during gradient accumulation (during each backward). What we want is to keep either `-1` or `1` until the optimizer uses this feedback. Otherwise it could grow to, say -16 for 16 steps of gradient accumulation. – Szymon Maszke Apr 19 '23 at 21:44
  • https://github.com/torch/optim/pull/41#issuecomment-73935805, but it does not exactly support your assertion. – Kevin Yin Apr 20 '23 at 03:10
28

For L2 regularization,

l2_lambda = 0.01
l2_reg = torch.tensor(0.)

for param in model.parameters():
    l2_reg += torch.norm(param)

loss += l2_lambda * l2_reg

References:

iacob
  • 20,084
  • 6
  • 92
  • 119
Sherif Ali
  • 397
  • 3
  • 3
  • 2
    Shouldn't one need to exclude non-trainable parameters? – Girishkumar Aug 30 '19 at 10:54
  • 6
    `torch.norm` is taking 2-norm here, not the square of the 2-norm. So I think the norm should be squared to get a correct regularization. – John Liu Sep 03 '20 at 04:36
  • 3
    without requires_grad and use += would cause error. This works for me: l2_reg = torch.tensor(0., requires_grad=True) l2_reg = l2_reg + torch.norm(param) – cswu Oct 06 '20 at 12:54
  • Warning: `torch.norm` is [deprecated](https://pytorch.org/docs/stable/generated/torch.norm.html). – iacob Apr 01 '22 at 10:57
22

L2 regularization out-of-the-box

Yes, pytorch optimizers have a parameter called weight_decay which corresponds to the L2 regularization factor:

sgd = torch.optim.SGD(model.parameters(), weight_decay=weight_decay)

L1 regularization implementation

There is no analogous argument for L1, however this is straightforward to implement manually:

loss = loss_fn(outputs, labels)
l1_lambda = 0.001
l1_norm = sum(torch.linalg.norm(p, 1) for p in model.parameters())

loss = loss + l1_lambda * l1_norm

The equivalent manual implementation of L2 would be:

l2_reg = sum(p.pow(2).sum() for p in model.parameters())

Source: Deep Learning with PyTorch (8.5.2)

iacob
  • 20,084
  • 6
  • 92
  • 119
18

for L1 regularization and include weight only:

l1_reg = torch.tensor(0., requires_grad=True)

for name, param in model.named_parameters():
    if 'weight' in name:
        l1_reg = l1_reg + torch.linalg.norm(param, 1)

total_loss = total_loss + 10e-4 * l1_reg
David
  • 361
  • 1
  • 3
  • 15
oukohou
  • 401
  • 4
  • 9
  • Warning: `torch.norm` is [deprecated](https://pytorch.org/docs/stable/generated/torch.norm.html). – iacob Apr 01 '22 at 10:51
6

Interesting torch.norm is slower on CPU and faster on GPU vs. direct approach.

import torch
x = torch.randn(1024,100)
y = torch.randn(1024,100)

%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)

Out:

1000 loops, best of 3: 910 µs per loop
1000 loops, best of 3: 1.76 ms per loop

On the other hand:

import torch
x = torch.randn(1024,100).cuda()
y = torch.randn(1024,100).cuda()

%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)

Out:

10000 loops, best of 3: 50 µs per loop
10000 loops, best of 3: 26 µs per loop
prosti
  • 42,291
  • 14
  • 186
  • 151
  • 3
    Confirmed that on my end as well. torch.norm is about 60% slower in this example. – Muppet Jun 02 '19 at 20:05
  • This answer is incorrect, GPU calculations are nonblocking, this means that timeit will not work correctly, because calculations are still in progress on GPU even after CPU (where the timeit happens) takes control. To get the correct timing, you must synchronize before stopping timer. – seermer Jan 31 '23 at 02:44
  • If you time it correctly, you will see torch.norm is almost twice as fast as sqrt approach (by using torch.cuda.synchronize before stop timer) – seermer Jan 31 '23 at 02:46
1

To extend on the good answers: As it was said, L2 norm added to the loss is equivalent to weight decay iff you use plain SGD without momentum. Otherwise, e.g. with Adam, it is not exactly the same. The AdamW paper [1] pointed out that weight decay is actually more stable. That is why you should use weight decay, which is an option to the optimizer. And consider using AdamW instead of Adam.

Also note, you probably don't want weight decay on all parameters (model.parameters()), but only on a subset. See here for examples:

[1] Decoupled Weight Decay Regularization (AdamW), 2017

Albert
  • 65,406
  • 61
  • 242
  • 386