9

If the normalization parameter is set to True in any of the linear models in sklearn.linear_model, is normalization applied during the score step?

For example:

from sklearn import linear_model
from sklearn.datasets import load_boston

a = load_boston()

l = linear_model.ElasticNet(normalize=False)
l.fit(a["data"][:400], a["target"][:400])
print l.score(a["data"][400:], a["target"][400:])
# 0.24192774524694727

l = linear_model.ElasticNet(normalize=True)
l.fit(a["data"][:400], a["target"][:400])
print l.score(a["data"][400:], a["target"][400:])
# -2.6177006348389167

In this case we see a degradation in the prediction power when we set normalize=True, and I can't tell if this is simply an artifact of the score function not applying the normalization, or if the normalized values caused the model performance to drop.

Shapi
  • 5,493
  • 4
  • 28
  • 39
mgoldwasser
  • 14,558
  • 15
  • 79
  • 103
  • IIRC this option is deprecated and normalization should be done with the tools in `sklearn.preprocessing`, e.g. `sklearn.preprocessing.StandardScaler` or `sklearn.preprocessing.Normalizer` – eickenberg Oct 24 '15 at 10:12

2 Answers2

8

The normalization is indeed applied to both fit data and predict data. The reason you see such different results is that the range of the columns in the Boston House Price dataset varies widely:

>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> boston.data.std(0)
array([  8.58828355e+00,   2.32993957e+01,   6.85357058e+00,
         2.53742935e-01,   1.15763115e-01,   7.01922514e-01,
         2.81210326e+01,   2.10362836e+00,   8.69865112e+00,
         1.68370495e+02,   2.16280519e+00,   9.12046075e+01,
         7.13400164e+00])

This means that the regularization terms in the ElasticNet have a very different effect on normalized vs unnormalized data, and this is why the results differ. You can confirm this by setting the regularization strength (alpha) to a very small number, e.g. 1E-8. In this case, regularization has very little effect and the normalization no longer affects prediction results.

jakevdp
  • 77,104
  • 11
  • 125
  • 160
4

@jakevdp already answered this question correctly, but for those interested, here's the proof that the normalization is getting correctly applied:

from sklearn.preprocessing import Normalizer
from sklearn import linear_model
from sklearn.datasets import load_boston

a = load_boston()

n = Normalizer()

a["data"][:400] = n.fit_transform(a["data"][:400])
a["data"][400:] = n.transform(a["data"][400:])

l = linear_model.ElasticNet(normalize=False)
l.fit(a["data"][:400], a["target"][:400])
print l.score(a["data"][400:], a["target"][400:])
# -2.61770063484

From the example in my original question, you can see that the model fit to pre-normalized data has the same score as the model with Normalize=True (the score is -2.61770063484).

mgoldwasser
  • 14,558
  • 15
  • 79
  • 103