2

I passed two streams of data to sgd_clf classifier as shown in below code. First partial_fit is taking first stream of data x1,y1. Second partial_fit is taking the second stream of data x2,y2.

The below code gives me error at second partial_fit step that class lables to be included prior. This error is gone when i include all my data from x2 y2 in x1, y1. (My class labels are included prior to calling second partial_fit now)

However, i cannot give this x2 y2 data prior. If at all i give all my data before first partial_fit(), why is there any need for me to use second partial_fit() ? Infact, if i know all data before, i dont need to use partial_fit(), i could just do fit().

from sklearn import neighbors, linear_model
import numpy as np

def train_new_data():

    sgd_clf = linear_model.SGDClassifier()

    x1 = [[8, 9], [20, 22]]
    y1 = [5, 6]

    classes = np.unique(y1)

    #print(classes)

    sgd_clf.partial_fit(x1,y1,classes=classes)

    x2 = [10, 12]
    y2 = 8


    sgd_clf.partial_fit([x2], [y2],classes=classes)#Error here!!

    return sgd_clf

if __name__ == "__main__":

    print(train_new_data().predict([[20,22]]))

Q1: Is my understanding of partial_fit() for sklearn classifiers wrong that it takes data on the fly as specified here: Incremental Learning

Q2: I want to retrain a model/update a model with the new data. I dont want to train from scratch. Will partial_fit help me with this ?

Q3: I am not specific only to SGDClassifier. I can use any algorithm that support online/batch learning. My main intention is Q3. I have a trained model on 1000's of images. I dont want to retrain this model from scratch just because i have one/two new samples of images. Neither interested in creating a new model for each new entry and then mix all of them. This decreases my performance for predictions to search all over the trained models. I just want to add this new data instances to the trained model with the help of partial_fit. Is this feasible ?

Q4: If i cannot acheive Q2 with scikit classifiers, Please direct me how i can achieve this

Any suggestions or ideas or references are much appreciated.

user1
  • 391
  • 3
  • 27

1 Answers1

3

You need to know beforehand how many classes you are going to need. After the first call to partial fit, the algorithm assumes you will not have any new classes to add later.

In your example, you are added in a new class (y2 = 8) that has never been seen before and was not indicated as existing in your initial call to partial fit (that only contained the labels "5" and "6"). You need at add it to the classes object on the first call.

I would also recommend you number your classes starting from 0 just for consistency's sake.

Raff.Edward
  • 6,404
  • 24
  • 34
  • Why is it like this only with classifiers ? The same code with SGDRegressor doesnot get this error. I feel like i hit dead end. I cant know data before and regressors dont serve my pupose of predicting the label. I cant use any algo that dont support partial_fit. I will just try with random forest warm_start and end my search. – user1 Feb 23 '18 at 23:00
  • My last hope is to try kerasclassifier. Do you know if kerasclassifier also need class labels prior like scikit classifiers ? – user1 Feb 23 '18 at 23:05
  • @krishnadamarla Sure kerasclassifier does. Please do some very basic research into **classification**. The term classification also distinguishes this task from **regression** (*SGDRegressor does not get this error*). – sascha Feb 24 '18 at 03:44
  • @sascha ok. Thank you. I just got distracted by the fact that sklearn specifying some 6 classifiers can perform [incremental learning](http://scikit-learn.org/stable/modules/scaling_strategies.html). whereas in reality, they actually cant. – user1 Feb 24 '18 at 12:00
  • 1
    @krishnadamarla The scikit-learn models *can* do incremental learning. Your problem (new classes with new data) is referred to as *few-shot* learning. These terms have generally accepted meanings within the ML community, and knowing / using the correct terms will help you find what you want. Granted, few-shot learning is a harder problem, and there is less ready-to-go code for it. – Raff.Edward Feb 26 '18 at 05:07
  • @Raff.Edward, is few-shot learning same as online leaning (learn from instance by instance ) ? – user1 Mar 06 '18 at 10:43
  • 1
    few-show learning is not the same as online learning, but they are related. Few-shot usually involves online-learning. Online-learning generally does not involve new classes, just new data. Few-show involves new class and new data. – Raff.Edward Mar 06 '18 at 18:48