4

I have a matrix where each column has mean 0 and std 1

In [67]: x_val.std(axis=0).min()
Out[70]: 0.99999999999999922

In [71]: x_val.std(axis=0).max()
Out[71]: 1.0000000000000007

In [72]: x_val.mean(axis=0).max()
Out[72]: 1.1990408665951691e-16

In [73]: x_val.mean(axis=0).min()
Out[73]: -9.7144514654701197e-17

The number of non 0 coefficients changes if I use the normalize option

In [74]: l = Lasso(alpha=alpha_perc70).fit(x_val, y_val)

In [81]: sum(l.coef_!=0)
Out[83]: 47

In [84]: l2 = Lasso(alpha=alpha_perc70, normalize=True).fit(x_val, y_val)

In [93]: sum(l2.coef_!=0)
Out[95]: 3

It seems to me that normalize just set the variance of each columns to 1. This is strange that the results change so much. My data has already variance=1.

So what does normalize=T actually do?

Donbeo
  • 17,067
  • 37
  • 114
  • 188
  • Comparing floats using `==` isn't a good idea. It's better to check absolute difference from 0 against very small number, like 1e-10. Can you perform same experiment with `abs`? – Artem Sobolev Jun 07 '14 at 20:46
  • The problem is real, see my example below. I am not completely sure where exactly it comes from, though. This is intriguing - especially because I just found out that there must be some extra rescaling going on somewhere, since `alpha_max = np.abs(X.T.dot(y)).max()` is not a tight lower bound on the zero solution penalty set. – eickenberg Jun 07 '14 at 20:49
  • @eickenberg Can you confirm that lasso is working correctly if I set normalize=False? – Donbeo Jun 08 '14 at 02:52
  • I updated my answer - it's a scaling problem. I will elaborate it a little more later. – eickenberg Jun 08 '14 at 07:01
  • Yes, lasso works just fine. Stick to `normalize=False` and use the `StandardScaler` in a `Pipeline` as described below. Note that the maximal penalty formula also contains a scaling term (described in note below). – eickenberg Jun 08 '14 at 11:00

2 Answers2

8

This is due to an (or a potential [1]) inconsistency in the concept of scaling in sklearn.linear_model.base.center_data: If normalize=True, then it will divide by the norm of each column of the design matrix, not by the standard deviation . For what it's worth, the keyword normalize=True will be deprecated from sklearn version 0.17.

Solution: Do not use standardize=True. Instead, build a sklearn.pipeline.Pipeline and prepend a sklearn.preprocessing.StandardScaler to your Lasso object. That way you don't even need to perform your initial scaling.

Note that the data loss term in the sklearn implementation of Lasso is scaled by n_samples. Thus the minimal penalty yielding a zero solution is alpha_max = np.abs(X.T.dot(y)).max() / n_samples (for normalize=False).

[1] I say potential inconsistency, because normalize is associated to the word norm and thus at least linguistically consistent :)

[Stop reading here if you don't want the details]

Here is some copy and pasteable code reproducing the problem

import numpy as np
rng = np.random.RandomState(42)

n_samples, n_features, n_active_vars = 20, 10, 5
X = rng.randn(n_samples, n_features)
X = ((X - X.mean(0)) / X.std(0))

beta = rng.randn(n_features)
beta[rng.permutation(n_features)[:n_active_vars]] = 0.

y = X.dot(beta)

print X.std(0)
print X.mean(0)

from sklearn.linear_model import Lasso

lasso1 = Lasso(alpha=.1)
print lasso1.fit(X, y).coef_

lasso2 = Lasso(alpha=.1, normalize=True)
print lasso2.fit(X, y).coef_

In order to understand what is going on, now observe that

lasso1.fit(X / np.sqrt(n_samples), y).coef_ / np.sqrt(n_samples)

is equal to

lasso2.fit(X, y).coef_

Hence, scaling the design matrix and appropriately rescaling the coefficients by np.sqrt(n_samples) converts one model to the other. This can also be achieved by acting on the penalty: A lasso estimator with normalize=True with its penalty scaled down by np.sqrt(n_samples) acts like a lasso estimator with normalize=False (on your type of data, i.e. already standardized to std=1).

lasso3 = Lasso(alpha=.1 / np.sqrt(n_samples), normalize=True)
print lasso3.fit(X, y).coef_  # yields the same coefficients as lasso1.fit(X, y).coef_
eickenberg
  • 14,152
  • 1
  • 48
  • 52
0

I think the top answer is wrong...

In Lasso, if you set normalize=True, every column will be divided by its L2 norm (i.e., sd*sqrt(n)) before fitting a lasso regression. The magnitude of design matrix is thus reduced, and the "expected" coefficients will be enlarged. The larger the coefficients, the stronger the L1 penalty. So the function has to pay more attention to L1 penalty, and make more features to be 0. You will see more sparse features (β=0) as a result.