101

I have noticed that weight_regularizer is no more available in Keras and that, in its place, there are activity and kernel regularizer. I would like to know:

  • What are the main differences between kernel and activity regularizers?
  • Could I use activity_regularizer in place of weight_regularizer?
user2314737
  • 27,088
  • 20
  • 102
  • 114
Simone
  • 4,800
  • 12
  • 30
  • 46

3 Answers3

96

The activity regularizer works as a function of the output of the net, and is mostly used to regularize hidden units, while weight_regularizer, as the name says, works on the weights (e.g. making them decay). Basically you can express the regularization loss as a function of the output (activity_regularizer) or of the weights (weight_regularizer).

The new kernel_regularizer replaces weight_regularizer - although it's not very clear from the documentation.

From the definition of kernel_regularizer:

kernel_regularizer: Regularizer function applied to the kernel weights matrix (see regularizer).

And activity_regularizer:

activity_regularizer: Regularizer function applied to the output of the layer (its "activation"). (see regularizer).

Important Edit: Note that there is a bug in the activity_regularizer that was only fixed in version 2.1.4 of Keras (at least with Tensorflow backend). Indeed, in the older versions, the activity regularizer function is applied to the input of the layer, instead of being applied to the output (the actual activations of the layer, as intended). So beware if you are using an older version of Keras (before 2.1.4), activity regularization may probably not work as intended.

You can see the commit on GitHub

Five months ago François Chollet provided a fix to the activity regularizer, that was then included in Keras 2.1.4

Bas Krahmer
  • 489
  • 5
  • 11
Michele Tonutti
  • 4,298
  • 1
  • 21
  • 22
  • Are you completely sure about that `kernel_regularizer` replace `weight_regularizer`? – Simone Jun 12 '17 at 11:07
  • 3
    I find many examples using kernel_regularizer, but not for activity_regularizer. Can you comment on the use cases for activity_regularizer? – Milad M Apr 25 '18 at 07:06
  • 1
    Why would you want to regularize the output of hidden layers? Is it because of the same reason we normalize out inputs to range (-1, 1) or (0, 1). That is to keep the inputs to subsequent layers smaller to aid the SGD process? – Nagabhushan Baddi Oct 14 '18 at 13:06
  • 1
    @NagabhushanBaddi see this answer: https://datascience.stackexchange.com/a/15195/32811 – Michele Tonutti Oct 15 '18 at 08:12
  • Where did you find the documentation on either kernel_regularizer or activity_regularizer? The supposed "documentation" at keras.io says nothing whatsoever about what it is and the help() within python is useless, making vague references to being a subclass or instance of other things. – Finncent Price Dec 05 '18 at 09:54
  • 1
    @FinncentPrice I can only assume it used to be there and now it isn't anymore – Michele Tonutti Dec 05 '18 at 15:36
77

This answer is a bit late, but is useful for the future readers. So, necessity is the mother of invention as they say. I only understood it when I needed it.
The above answer doesn't really state the difference cause both of them end up affecting the weights, so what's the difference between punishing for the weights themselves or the output of the layer?
Here is the answer: I encountered a case where the weights of the net are small and nice, ranging between [-0.3] to [+0.3].
So, I really can't punish them, there is nothing wrong with them. A kernel regularizer is useless. However, the output of the layer is HUGE, in 100's.
Keep in mind that the input to the layer is also small, always less than one. But those small values interact with the weights in such a way that produces those massive outputs. Here I realized that what I need is an activity regularizer, rather than kernel regularizer. With this, I'm punishing the layer for those large outputs, I don't care if the weights themselves are small, I just want to deter it from reaching such state cause this saturates my sigmoid activation and causes tons of other troubles like vanishing gradient and stagnation.

Hossein
  • 24,202
  • 35
  • 119
  • 224
Alex Deft
  • 2,531
  • 1
  • 19
  • 34
2

Helpful answer above. I want to add another use-case I have encountered, and which I've used it for.

The use-case is in the context of autoencoders, where you want to limit the size of the encodings to a small number (say 10 units, for example).

One way of forcing an encoding size of 10 units is to simply make an autoencoder whose coding layer is limited to 10 units. This is a hard limit, but has some drawbacks.

An alternative way of getting small-sized encodings is to impose a 'soft limit' using activity regularisation. You create a large coding layer (e.g. 100 units), but you penalize the net whenever it activates more than 10 units from the 100 available units. You don't care about the weights/kernel of the encoding layer (hence no kernel regularisation); you instead penalise the output/activity of that layer such that you push it towards more sparse outputs/encodings. This results in a "sparse autoencoder" as you've encouraged the net to only use 10% of its available coding size on average.

Keras example of this use case is here: https://github.com/ageron/handson-ml3/blob/main/17_autoencoders_gans_and_diffusion_models.ipynb and the book goes into more detail.

A sketch of my training loop in PyTorch:

autoencoder = torch.nn.Sequential(encoder, decoder)

for epoch in range(n_epochs):
  for X_in in train_data:
    X_recon = autoencoder(X_in)
    
    #For a basic autoencoder, the loss is simply MSE:
    mse_loss = torch.nn.MSELoss()(X_recon, X_in)
    
    #For a sparse autoencoder using activity regularisation,
    #  add a loss term penalising too many active outputs:
    codings = encoder(X) #get the activations from the coding layer
    activity_regularisation_l1_loss = codings.abs().sum()

    total_loss = mse_loss + 1e-3 * activity_regularisation_l1_loss

    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()
  
some3128
  • 1,430
  • 1
  • 2
  • 8