2

I am using Sci-Kit learn's svm library for classifying images. I was wondering when I fit the testing data does it work sequentially or does it erase the previous classification material and re-fit to the new testing data. For example if I fit 100 images to the classifier can I go ahead and then sequentially fit another 100 images or will the SVM delete the work it performed on the original 100 images. This is difficult to explain for me so I'll provide and example:

In order to fit a SVM classifier to 200 images can I do this:

clf=SVC(kernel='linear')
clf.fit(test.data[0:100], test.target[0:100])
clf.fit(test.data[100:200], test.target[100:200])

Or must I do this:

clf=SVC(kernel='linear')
clf.fit(test.data[:200], test.target[:200])

I am wondering only because I run into memory errors when trying to use .fit(X, y) with too many images at once. So is it possible to use fit sequentially and "increment" my classifier upwards so that it is techincally trained on 10000 images but only 100 at a time.

If this is possible please confirm and explain? And if its not possible please explain?

Troll_Hunter
  • 465
  • 2
  • 9
  • 15
  • Maybe try dimensionality reduction and/or feature selection if you're running into memory errors – Ryan Aug 12 '15 at 18:02
  • I already used dimensionality reduction but haven't tried feature selection. If I reduce the images any more I might lose necessary pixel data. – Troll_Hunter Aug 12 '15 at 18:05

1 Answers1

3

http://scikit-learn.org/stable/developers/index.html#estimated-attributes

The last-mentioned attributes are expected to be overridden when you call fit a second time without taking any previous value into account: fit should be idempotent.

https://en.wikipedia.org/wiki/Idempotent

So yes, second call will erase old model and compute new one. You can check it by yourself if you understand python code. For example in sklearn/svm/classes.py

I think you need minibatch training, but i don't see partial_fit implementation for SVM, maybe it's because scikit-learn team recommend SGDClassifier and SGDRegressor for dataset with size more than 100k samples. http://scikit-learn.org/stable/tutorial/machine_learning_map/, try to use them with minibatch as described here.

Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52