Is it possible to train a sklearn model (eg SVM) incrementally?

Question

I'm trying to perform sentiment analysis over the twitter dataset "Sentiment140" which consists of 1.6 million labelled tweets . I'm constructing my feature vector using Bag Of Words ( Unigram ) model , so each tweet is represented by about 20000 features . Now to train my sklearn model (SVM,Logistic Regression,Naive Bayes) using this dataset , i have to load the entire 1.6m x 20000 feature vectors into one variable and then feed it to the model . Even on my server machine which has a total of 115GB of memory , it causes the process to be killed .

So i wanted to know if i can train the model instance by instance , rather than loading the entire dataset into one variable ?

If sklearn does not have this flexibility , then is there any other libraries that you could recommend (which support sequential learning) ?

If the data does not fit in the memory you may think about decreasing dimensionality via PCA or word2vec — Sergey Bushmanov, Feb 17 '19 at 09:50

desertnaut · Accepted Answer · 2019-02-16T12:57:19.477

It is not really necessary (let alone efficient) to go to the other extreme and train instance by instance; what you are looking for is actually called incremental or online learning, and it is available in scikit-learn's SGDClassifier for linear SVM and logistic regression, which indeed contains a partial_fit method.

Here is a quick example with dummy data:

import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier(max_iter=1000, tol=1e-3)

clf.partial_fit(X, Y, classes=np.unique(Y))

X_new = np.array([[-1, -1], [2, 0], [0, 1], [1, 1]])
Y_new = np.array([1, 1, 2, 1])
clf.partial_fit(X_new, Y_new)

The default values for the loss and penalty arguments ('hinge' and 'l2' respectively) are these of a LinearSVC, so the above code essentially fits incrementally a linear SVM classifier with L2 regularization; these settings can of course be changed - check the docs for more details.

It is necessary to include the classes argument in the first call, which should contain all the existing classes in your problem (even though some of them might not be present in some of the partial fits); it can be omitted in subsequent calls of partial_fit - again, see the linked documentation for more details.

Is the classes (classes=np.unique(Y)) only for this particular slice of data ? or does it have to be the whole set of classes stored somehwere? — Arun George, Nov 01 '19 at 08:37
@desertnaut: Can you explain what is meant by **instance by instance training** — aspiring1, Jan 05 '20 at 16:36
@aspiring1 it simply means one sample (instance) at the time — desertnaut, Jan 06 '20 at 23:09

Is it possible to train a sklearn model (eg SVM) incrementally?

1 Answers1

Linked