47

I'm trying to use SGD to classify a large dataset. As the data is too large to fit into memory, I'd like to use the partial_fit method to train the classifier. I have selected a sample of the dataset (100,000 rows) that fits into memory to test fit vs. partial_fit:

from sklearn.linear_model import SGDClassifier

def batches(l, n):
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

clf1 = SGDClassifier(shuffle=True, loss='log')
clf1.fit(X, Y)

clf2 = SGDClassifier(shuffle=True, loss='log')
n_iter = 60
for n in range(n_iter):
    for batch in batches(range(len(X)), 10000):
        clf2.partial_fit(X[batch[0]:batch[-1]+1], Y[batch[0]:batch[-1]+1], classes=numpy.unique(Y))

I then test both classifiers with an identical test set. In the first case I get an accuracy of 100%. As I understand it, SGD by default passes 5 times over the training data (n_iter = 5).

In the second case, I have to pass 60 times over the data to reach the same accuracy.

Why this difference (5 vs. 60)? Or am I doing something wrong?

David M.
  • 4,518
  • 2
  • 20
  • 25
  • 1
    Give `verbose=1` to the SGD constructor, that may give you a hint. – Fred Foo Jul 08 '14 at 13:23
  • First case (fit) ends with "-- Epoch 5 Norm: 29.25, NNZs: 300, Bias: -1.674706, T: 459595, Avg. loss: 0.076786". Second case (partial_fit) after 10 passes ends with "-- Epoch 1 Norm: 22.99, NNZs: 300, Bias: -1.999685, T: 1918, Avg. loss: 0.089302". What should I be looking for? thx – David M. Jul 08 '14 at 14:46
  • The average loss. Check if it drops faster in the batch case. – Fred Foo Jul 08 '14 at 14:58
  • In the first case it drops from 0.087027 to 0.076786 in 15 passes (5 epochs; 3 passes/epoch). In the second case it's difficult to tell because it seems to me that the avg loss figures relate to each individual batch; hence great variations in the numbers (e.g. the last 10 figures are 0.000748; 0.258055; 0.001160; 0.267540; 0.036631; 0.291704; 0.197599; 0.012074; 0.109227; 0.089302). – David M. Jul 08 '14 at 15:28

1 Answers1

78

I have finally found the answer. You need to shuffle the training data between each iteration, as setting shuffle=True when instantiating the model will NOT shuffle the data when using partial_fit (it only applies to fit). Note: it would have been helpful to find this information on the sklearn.linear_model.SGDClassifier page.

The amended code reads as follows:

from sklearn.linear_model import SGDClassifier
import random
clf2 = SGDClassifier(loss='log') # shuffle=True is useless here
shuffledRange = range(len(X))
n_iter = 5
for n in range(n_iter):
    random.shuffle(shuffledRange)
    shuffledX = [X[i] for i in shuffledRange]
    shuffledY = [Y[i] for i in shuffledRange]
    for batch in batches(range(len(shuffledX)), 10000):
        clf2.partial_fit(shuffledX[batch[0]:batch[-1]+1], shuffledY[batch[0]:batch[-1]+1], classes=numpy.unique(Y))
David M.
  • 4,518
  • 2
  • 20
  • 25
  • 10
    Shuffling the whole dataset would not be possible as the data does not fit in memory (if it did, we could simply use fit). Does shuffling the data inside the batches yield better results? – Fabio Picchi Mar 28 '18 at 13:54
  • 3
    If your data is stored in a way that is compatible with indexing such as filenames to files, or indexes to locations in an array, then you can store the data indexes separate from the data, and shuffle the indexes between each epoch. – skeller88 Jan 24 '20 at 23:48
  • It's being mentioned in the current version of user-guide https://scikit-learn.org/stable/modules/sgd.html `shuffle after each iteration`. I remember that the shuffling is mentioned in Andrew Ng youtube lecture too. – Thariq Nugrohotomo Nov 02 '21 at 06:40