GradientBoostingClassifier with a BaseEstimator in scikit-learn?

Question

I tried to use GradientBoostingClassifier in scikit-learn and it works fine with its default parameters. However, when I tried to replace the BaseEstimator with a different classifier, it did not work and gave me the following error,

return y - np.nan_to_num(np.exp(pred[:, k] -
IndexError: too many indices

Do you have any solution for the problem.

This error can be regenerated using the following snippets:

import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle

mnist = datasets.fetch_mldata('MNIST original')
X, y = shuffle(mnist.data, mnist.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.01)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

### works fine when init is None
clf_init = None
print 'Train with clf_init = None'
clf = GradientBoostingClassifier( (loss='deviance', learning_rate=0.1,
                             n_estimators=5, subsample=0.3,
                             min_samples_split=2,
                             min_samples_leaf=1,
                             max_depth=3,
                             init=clf_init,
                             random_state=None,
                             max_features=None,
                             verbose=2,
                             learn_rate=None)
clf.fit(X_train, y_train)
print 'Train with clf_init = None is done :-)'

print 'Train LogisticRegression()'
clf_init = LogisticRegression();
clf_init.fit(X_train, y_train);
print 'Train LogisticRegression() is done'

print 'Train with clf_init = LogisticRegression()'
clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.1,
                             n_estimators=5, subsample=0.3,
                             min_samples_split=2,
                             min_samples_leaf=1,
                             max_depth=3,
                             init=clf_init,
                             random_state=None,
                             max_features=None,
                             verbose=2,
                             learn_rate=None)
 clf.fit(X_train, y_train) # <------ ERROR!!!!
 print 'Train with clf_init = LogisticRegression() is done'

Here is hte complete traceback of the error:

Traceback (most recent call last):
File "/home/mohsena/Dropbox/programing/gbm/gb_with_init.py", line 56, in <module>
   clf.fit(X_train, y_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/gradient_boosting.py", line 862, in fit
   return super(GradientBoostingClassifier, self).fit(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/gradient_boosting.py", line 614, in fit random_state)
File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/gradient_boosting.py", line 475, in _fit_stage
   residual = loss.negative_gradient(y, y_pred, k=k)
File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/gradient_boosting.py", line 404, in negative_gradient
   return y - np.nan_to_num(np.exp(pred[:, k] -
   IndexError: too many indices

score 9 · Answer 1 · edited May 23 '17 at 12:17

9

An improved version of iampat's answer and slight modification of scikit-developers's answer should do the trick.

class init:
    def __init__(self, est):
        self.est = est
    def predict(self, X):
        return self.est.predict_proba(X)[:,1][:,numpy.newaxis]
    def fit(self, X, y):
        self.est.fit(X, y)

edited May 23 '17 at 12:17

Community

1
1

answered Oct 30 '13 at 10:34

Santosh

121
1
7

score 5 · Accepted Answer · answered Jul 04 '13 at 10:39

5

As suggested by scikit-learn developers, the problem can be solved by using an adaptor like this:

def __init__(self, est):
   self.est = est
def predict(self, X):
    return self.est.predict_proba(X)[:, 1]
def fit(self, X, y):
    self.est.fit(X, y)

answered Jul 04 '13 at 10:39

iampat

1,072
1
12
23

1

Hi, I am facing a very similar error with GBC and LR: y_pred[:, k] += learning_rate * tree.predict(X).ravel() IndexError: too many indices - tried to use your adaptor idea but to no avail, the error remains. Do you have any ideas how to resolve this? – abalogh Oct 22 '13 at 10:04

score 5 · Answer 3 · answered May 28 '15 at 12:43

Here is a complete and, in my opinion, simpler version of iampat's code snippet.

    class RandomForestClassifier_compability(RandomForestClassifier):
        def predict(self, X):
            return self.predict_proba(X)[:, 1][:,numpy.newaxis]
    base_estimator = RandomForestClassifier_compability()
    classifier = GradientBoostingClassifier(init=base_estimator)

score 4 · Answer 4 · answered Jul 03 '13 at 17:43

4

Gradient Boosting generally requires the base learner to be an algorithm that performs numeric prediction, not classification. I assume that is your issue.

answered Jul 03 '13 at 17:43

Raff.Edward

6,404
24
34

Thanks for your comment. If you take a look at sklearn/ensemble/gradient_boosting.py, you can see that it has a support for classification problems (look for esidual = loss.negative_gradient(y, y_pred, k=k)) – iampat Jul 03 '13 at 18:10
Gradient Boosting is inherently a regression algorithm. It can be adapted to classification with a proper loss function. I dont have time to read their all their code, but that doesn't mean it is not using a regressor to perform classification. The Least Squares loss function uses that same method. I would try another non-linear regressor first to see if the problem is indeed that you should be using a regression algorithm. I've never seen anyone use Gradient Boosting with a non-regression capable algorithm. – Raff.Edward Jul 03 '13 at 21:01

GradientBoostingClassifier with a BaseEstimator in scikit-learn?

4 Answers4

Linked