Lack of Sparse Solution with L1 Regularization in Pytorch

Question

I am trying to implement L1 regularization onto the first layer of a simple neural network (1 hidden layer). I looked into some other posts on StackOverflow that apply l1 regularization using Pytorch to figure out how it should be done (references: Adding L1/L2 regularization in PyTorch?, In Pytorch, how to add L1 regularizer to activations?). No matter how high I increase lambda (the l1 regularization strength parameter) I do not get true zeros in the first weight matrix. Why would this be? (Code is below)

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class Network(nn.Module):
    def __init__(self,nf,nh,nc):
        super(Network,self).__init__()
        self.lin1=nn.Linear(nf,nh)
        self.lin2=nn.Linear(nh,nc)

    def forward(self,x):
        l1out=F.relu(self.lin1(x))
        out=F.softmax(self.lin2(l1out))
        return out, l1out

def l1loss(layer):
    return torch.norm(layer.weight.data, p=1)

nf=10
nc=2
nh=6
learningrate=0.02
lmbda=10.
batchsize=50

net=Network(nf,nh,nc)

crit=nn.MSELoss()
optimizer=torch.optim.Adagrad(net.parameters(),lr=learningrate)


xtr=torch.Tensor(xtr)
ytr=torch.Tensor(ytr)
#ytr=torch.LongTensor(ytr)
xte=torch.Tensor(xte)
yte=torch.LongTensor(yte)
#cyte=torch.Tensor(yte)

it=200
for epoch in range(it):
    per=torch.randperm(len(xtr))
    for i in range(0,len(xtr),batchsize):
        ind=per[i:i+batchsize]
        bx,by=xtr[ind],ytr[ind]            
        optimizer.zero_grad()
        output, l1out=net(bx)
#        l1reg=l1loss(net.lin1)    
        loss=crit(output,by)+lmbda*l1loss(net.lin1)
        loss.backward()
        optimizer.step()
    print('Epoch [%i/%i], Loss: %.4f' %(epoch+1,it, np.float32(loss.data.numpy())))

corr=0
tot=0
for x,y in list(zip(xte,yte)):
    output,_=net(x)
    _,pred=torch.max(output,-1)
    tot+=1 #y.size(0)
    corr+=(pred==y).sum()
print(corr)

Note: The data has 10 features (2 classes and 800 training samples) and only the first 2 are relevant (by design) so one would assume true zeros should be easy enough to learn.

score 7 · Accepted Answer · answered Apr 27 '18 at 03:24

7

Your usage of layer.weight.data removes the parameter (which is a PyTorch variable) from its automatic differentiation context, making it a constant when the optimiser takes the gradients. This results in zero gradients and that the L1 loss is not computed.

If you remove the .data, the norm is computed of the PyTorch variable and the gradients should be correct.

For more information on PyTorch's automatic differentiation mechanics, see this docs article or this tutorial.

answered Apr 27 '18 at 03:24

Pim

166
1
7

I was able to get zeros by correcting that. Not in the places I would think though. Oh well. Seems like I just am not going to get results similar to what I would get with l1 regularized logistic regression. – cyradil Apr 27 '18 at 13:32

Lack of Sparse Solution with L1 Regularization in Pytorch

1 Answers1

Linked