3

I discovered what appears to be a bug in sklearn.RandomizedLogistic, and since it took me a long time to solve it, I'll post it here in case others have the same problem!

What happens is: on perfectly formatted data, sklearn.RandomizedLogistic claims "ValueError: The number of classes has to be greater than one."

It turns out that this happens when the input data has fewer than 9 training instances:

>>>sklearn.__version__
'0.15-git'

>>> randomized_logistic.fit(X[0:10, :], y[0:10])
RandomizedLogisticRegression(C=1, fit_intercept=True,
               memory=Memory(cachedir=None), n_jobs=1, n_resampling=200,
               normalize=True, pre_dispatch='3*n_jobs', random_state=None,
               sample_fraction=0.75, scaling=0.5, selection_threshold=0.25,
               tol=0.001, verbose=False)

>>> randomized_logistic.fit(X[0:9, :], y[0:9])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/isaac/Library/Python/2.7/lib/python/site-packages/sklearn/linear_model/randomized_l1.py", line 109, in fit
    sample_fraction=self.sample_fraction, **params)
  File "/Users/isaac/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/memory.py", line 281, in __call__
    return self.func(*args, **kwargs)
  File "/Users/isaac/Library/Python/2.7/lib/python/site-packages/sklearn/linear_model/randomized_l1.py", line 51, in _resample_model
    for _ in range(n_resampling)):
  File "/Users/isaac/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py", line 644, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/isaac/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py", line 391, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/isaac/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py", line 129, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/isaac/Library/Python/2.7/lib/python/site-packages/sklearn/linear_model/randomized_l1.py", line 355, in _randomized_logistic
    clf.fit(X, y)
  File "/Users/isaac/Library/Python/2.7/lib/python/site-packages/sklearn/svm/base.py", line 676, in fit
    raise ValueError("The number of classes has to be greater than"
ValueError: The number of classes has to be greater than one.

>>> X
array([[1, 1, 1],
       [2, 1, 0],
       [3, 1, 1],
       [1, 2, 0],
       [2, 2, 1],
       [3, 2, 0],
       [1, 3, 1],
       [2, 3, 0],
       [3, 3, 1],
       [1, 4, 0],
       [2, 4, 1],
       [3, 4, 6]])

>>> y
array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])
  • What happens if you reduce the n_resampling parameter? My guess is that you are unlucky enough to get all classes equal somehow... though I don't know exactly how randomized models are made – Kyle Kastner Jul 22 '14 at 07:44
  • Update - it seems the randomization is a subset of the overall data. So if you have a very very large number of randomizations you may get "unlucky". Try reducing the number of resamplings, and setting a random_state so that the execution is deterministic – Kyle Kastner Jul 22 '14 at 09:04
  • Yeah, that is probably the reason. I don't know whether it subsamples using a `StratifiedShuffleSplit` internally, but even if so, I think it is not safe against failing to create stratified samples. Two things: 1) This bug report would be better seen and treated if addressed to the scikit-learn mailing list and 2) It is debatable whether the is a real use for randomized l1 methods to work well on 9 samples ... – eickenberg Jul 22 '14 at 20:14

0 Answers0