14

I'm playing with a one-vs-all Logistic Regression classifier using Scikit-Learn (sklearn). I have a large dataset that is too slow to run all at one go; also I would like to study the learning curve as the training proceeds.

I would like to use batch gradient descent to train my classifier in batches of, say, 500 samples. Is there some way of using sklearn to do this, or should I abandon sklearn and "roll my own"?

This is what I have so far:

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# xs are subsets of my training data, ys are ground truth for same; I have more 
# data available for further training and cross-validation:
xs.shape, ys.shape
# => ((500, 784), (500))
lr = OneVsRestClassifier(LogisticRegression())
lr.fit(xs, ys)
lr.predict(xs[0,:])
# => [ 1.]
ys[0]
# => 1.0

I.e. it correctly identifies a training sample (yes, I realize it would be better to evaluate it with new data -- this is just a quick smoke-test).

R.e. batch gradient descent: I haven't gotten as far as creating learning curves, but can one simply run fit repeatedly on subsequent subsets of the training data? Or is there some other function to train in batches? The documentation and Google are fairly silent on the matter. Thanks!

JohnJ
  • 4,753
  • 2
  • 28
  • 40

1 Answers1

31

What you want is not batch gradient descent, but stochastic gradient descent; batch learning means learning on the entire training set in one go, while what you describe is properly called minibatch learning. That's implemented in sklearn.linear_model.SGDClassifier, which fits a logistic regression model if you give it the option loss="log".

With SGDClassifier, like with LogisticRegression, there's no need to wrap the estimator in a OneVsRestClassifier -- both do one-vs-all training out of the box.

# you'll have to set a few other options to get good estimates,
# in particular n_iterations, but this should get you going
lr = SGDClassifier(loss="log")

Then, to train on minibatches, use the partial_fit method instead of fit. The first time around, you have to feed it a list of classes because not all classes may be present in each minibatch:

import numpy as np
classes = np.unique(["ham", "spam", "eggs"])

for xs, ys in minibatches:
    lr.partial_fit(xs, ys, classes=classes)

(Here, I'm passing classes for each minibatch, which isn't necessary but doesn't hurt either and makes the code shorter.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks for the detailed answer, which I will try ASAP. Regarding definitions, I was following the terminology from Andrew Ng's Coursera class; in his nomenclature, stochastic gradient descent involved changing the gradient with every training sample, and what you call minibatch he called batch gradient descent. But naming notwithstanding, I can see that this is what I was asking for -- thanks a lot! – JohnJ Feb 23 '13 at 14:22
  • 2
    @JohnJ: actually SGD can be used in batch, minibatch or online (one sample at a time) mode. The terminology I used here is that of prof. Hinton's Coursera NN/ML class, which I've found so far to be consistent with most of the literature. – Fred Foo Feb 23 '13 at 15:14
  • Wonderful, thanks. Your answer worked well for me so far as well. Do you recommend that class? Thanks again. – JohnJ Feb 23 '13 at 15:43
  • 2
    @JohnJ: yes, it's quite good. It continues where Ng left off and if you like dry wit, you're in for a treat. – Fred Foo Feb 23 '13 at 17:13
  • @larsmans, is there any way to pass, variable features to partial_fit function? – Jatin Bansal Jun 11 '15 at 04:48
  • @FredFoo do you have any resources that confirm `SGDClassifier` supports multi-label classification? The docs and http://stackoverflow.com/questions/20335853/scikit-multilabel-classification-valueerror-bad-input-shape say that it's the opposite – stpk Aug 31 '16 at 12:11
  • does SGDClassifier's fit method do batch gradient descent or online (one sample at a time) gradient descent? – Nagabhushan Baddi Sep 27 '18 at 07:33
  • @Nagabhusha Baddi according to my knowledge fit method uses one sample at a time for training while partial_fit method uses batch gd for training. Crt me if im wrong. – Kirushikesh Jan 17 '22 at 10:04