Helpful answer above. I want to add another use-case I have encountered, and which I've used it for.
The use-case is in the context of autoencoders, where you want to limit the size of the encodings to a small number (say 10 units, for example).
One way of forcing an encoding size of 10 units is to simply make an autoencoder whose coding layer is limited to 10 units. This is a hard limit, but has some drawbacks.
An alternative way of getting small-sized encodings is to impose a 'soft limit' using activity regularisation. You create a large coding layer (e.g. 100 units), but you penalize the net whenever it activates more than 10 units from the 100 available units. You don't care about the weights/kernel of the encoding layer (hence no kernel regularisation); you instead penalise the output/activity of that layer such that you push it towards more sparse outputs/encodings. This results in a "sparse autoencoder" as you've encouraged the net to only use 10% of its available coding size on average.
Keras example of this use case is here: https://github.com/ageron/handson-ml3/blob/main/17_autoencoders_gans_and_diffusion_models.ipynb and the book goes into more detail.
A sketch of my training loop in PyTorch:
autoencoder = torch.nn.Sequential(encoder, decoder)
for epoch in range(n_epochs):
for X_in in train_data:
X_recon = autoencoder(X_in)
#For a basic autoencoder, the loss is simply MSE:
mse_loss = torch.nn.MSELoss()(X_recon, X_in)
#For a sparse autoencoder using activity regularisation,
# add a loss term penalising too many active outputs:
codings = encoder(X) #get the activations from the coding layer
activity_regularisation_l1_loss = codings.abs().sum()
total_loss = mse_loss + 1e-3 * activity_regularisation_l1_loss
optimizer.zero_grad()
total_loss.backward()
optimizer.step()