7

This is a scikit-learn error that I get when I do

my_estimator = LassoLarsCV(fit_intercept=False, normalize=False, positive=True, max_n_alphas=1e5)

Note that if I decrease max_n_alphas from 1e5 down to 1e4 I do not get this error any more.

Anyone has an idea on what's going on?

The error happens when I call

my_estimator.fit(x, y)

I have 40k data points in 40 dimensions.

The full stack trace looks like this

  File "/usr/lib64/python2.7/site-packages/sklearn/linear_model/least_angle.py", line 1113, in fit
    axis=0)(all_alphas)
  File "/usr/lib64/python2.7/site-packages/scipy/interpolate/polyint.py", line 79, in __call__
    y = self._evaluate(x)
  File "/usr/lib64/python2.7/site-packages/scipy/interpolate/interpolate.py", line 498, in _evaluate
    out_of_bounds = self._check_bounds(x_new)
  File "/usr/lib64/python2.7/site-packages/scipy/interpolate/interpolate.py", line 525, in _check_bounds
    raise ValueError("A value in x_new is below the interpolation "
ValueError: A value in x_new is below the interpolation range.
Alex I
  • 19,689
  • 9
  • 86
  • 158
Baron Yugovich
  • 3,843
  • 12
  • 48
  • 76
  • 3
    when I run `from sklearn.linear_model import LassoLarsCV` followed by your line of code I get no error. please provide enough code to reproduce the error you are getting as well as the full traceback message. – Tadhg McDonald-Jensen Mar 30 '16 at 22:31
  • 1
    The error does not occur on that line, but when I call .fit(). Unfortunately, hard to reproduce here, my data set has 40k points. – Baron Yugovich Mar 31 '16 at 15:38
  • 1
    The interpolators in scipy often require that the `x` values are monotonically increasing. Is `x` monotonically increasing for your dataset? If they're not, try sorting the dataset with `x` as the key and try again. If it works, let me know and I'll add a proper answer for the bounty :) – J Richard Snape Apr 03 '16 at 23:50
  • Hmm, looking into this - whilst that might be the case at the point where the code fails, it doesn't really make sense from where you call `fit` as I'm guessing `x` is a 40000 x 40 matrix? – J Richard Snape Apr 04 '16 at 00:07
  • 1
    @BaronYugovich: Could you please upload your data somewhere? – Alex I Apr 04 '16 at 10:22
  • 6
    If there wasn't a bounty I'd vote to close as lacking a [mcve]. – ivan_pozdeev Apr 04 '16 at 11:09
  • 3
    Well - apologies for the "ridiculous suggestion", but you'll note that the bit that's actually throwing the error is `interpolate.py` in the `scipy` package, which does have those requirements. However, I'm not really minded to track it further if you won't put up data to reproduce and think it's a good idea to suggest that people offering free help are being ridiculous. – J Richard Snape Apr 04 '16 at 23:32
  • In addition - to ping people, you need to omit space from their user name and your assertion that the problem is not data related seems not to be backed by any evidence. I agree the `1e4` vs `1e5` difference is interesting, but we need a dataset to replicate and therefore track down, it doesn't happen with all data (as the existing answer shows) – J Richard Snape Apr 04 '16 at 23:35
  • same here, using LassoLarsCV give me the same error, my data set its smaller but same issue. did you find a solution to your problem ? its a problem with the scipy library ? [link](https://github.com/scipy/scipy/issues/2283) – Pablo Aug 22 '17 at 19:52

1 Answers1

5

There must be something particular to your data. LassoLarsCV() seems to be working correctly with this synthetic example of fairly well-behaved data:

import numpy
import sklearn.linear_model

# create 40000 x 40 sample data from linear model with a bit of noise
npoints = 40000
ndims = 40
numpy.random.seed(1)
X = numpy.random.random((npoints, ndims))
w = numpy.random.random(ndims)
y = X.dot(w) + numpy.random.random(npoints) * 0.1

clf = sklearn.linear_model.LassoLarsCV(fit_intercept=False, normalize=False, max_n_alphas=1e6)
clf.fit(X, y)

# coefficients are almost exactly recovered, this prints 0.00377
print max(abs( clf.coef_ - w ))

# alphas actually used are 41 or ndims+1
print clf.alphas_.shape

This is in sklearn 0.16, I don't have positive=True option.

I'm not sure why you would want to use a very large max_n_alphas anyway. While I don't know why 1e+4 works and 1e+5 doesn't in your case, I suspect the paths you get from max_n_alphas=ndims+1 and max_n_alphas=1e+4 or whatever would be identical for well behaved data. Also the optimal alpha that is estimated by cross-validation in clf.alpha_ is going to be identical. Check out Lasso path using LARS example for what alpha is trying to do.

Also, from the LassoLars documentation

alphas_ array, shape (n_alphas + 1,)

Maximum of covariances (in absolute value) at each iteration. n_alphas is either max_iter, n_features, or the number of nodes in the path with correlation greater than alpha, whichever is smaller.

so it makes sense that we end with alphas_ of size ndims+1 (ie n_features+1) above.

P.S. Tested with sklearn 0.17.1 and positive=True as well, also tested with some positive and negative coefficients, same result: alphas_ is ndims+1 or less.

Alex I
  • 19,689
  • 9
  • 86
  • 158
  • It has nothing to do with the data. On the same data set, when decreasing n_alphas, as specified above, the problem disappears. The error happens when generating alphas, not when dealing with the problem set. – Baron Yugovich Apr 04 '16 at 13:39
  • @BaronYugovich You see the code where with the different data set of the same dimensions, a huge max_n_alphas, there is no problem. Why do you think the problem is not data related? Please post a complete runnable example that reproduces your problem. Thanks :) – Alex I Apr 04 '16 at 18:13
  • Makes sense. Out of curiosity, with your experiment with random data, what do you get with orthogonal matching pursuit http://stackoverflow.com/questions/36287045/orthogonal-matching-pursuit-regression-am-i-using-it-wrong?noredirect=1#comment60438035_36287045 – Baron Yugovich Apr 05 '16 at 16:32
  • @BaronYugovich Does this address your question? I believe what you have found is indeed a skkearn bug, but it is very hard to reproduce without your data. Most importantly it makes no difference to the results you get, use any max alphas > 40 and you'll get the same results, as long as it doesn't crash. If you are satisfied please remember to award the bounty (and accept the answer) – Alex I Apr 08 '16 at 06:46