Why the following partial fit is not working property?

Question

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

Hello I have the following list of comments:

comments = ['I am very agry','this is not interesting','I am very happy']

These are the corresponding labels:

sents = ['angry','indiferent','happy']

I am using tfidf to vectorize these comments as follows:

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing

I am using label encoder to vectorize the labels:

le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

Here I am using passive aggressive to fit the model:

clf2 = PassiveAggressiveClassifier()


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)

with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

Here I am trying to test the usage of partial fit as follows with three new comments and their corresponding labels:

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)

print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)

The problem is that I am not getting the right results after the partial fit as follows:

print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

however I am getting this output:

[2 2 2]

So I really appreciate support to find, why the model is not being updated if I am testing it with the same examples that it has used to be trained the desired output should be:

[1,0,2]

I would like to appreciate support to ajust maybe the hyperparameters to see the desired output.

this is the complete code, to show the partial fit:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random


comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
    pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
    pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)

clf2 = PassiveAggressiveClassifier()

clf2.fit(tfidf, labels)


with open('passive.pickle','wb') as idxf:
    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)

with open('passive.pickle', 'rb') as infile:
    clf2 = pickle.load(infile)



with open('tfidf_vectorizer.pickle', 'rb') as infile:
    tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
    tfidf = pickle.load(infile)

new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]

vec_new_comments = tfidf_vectorizer.transform(new_comments)

clf2.partial_fit(vec_new_comments, new_labels)



print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))

However I got:

AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]

How are you fitting the `clf2`. Please post the whole code as one code snippet. Now its very annoying to copy paste again and again. — Vivek Kumar, Apr 15 '17 at 03:45
@VivekKumar I have updated the question, I added the complete code to reproduce my issue, thanks for the support — neo33, Apr 15 '17 at 04:18

score 1 · Accepted Answer · edited May 23 '17 at 11:46

Well there are multiple problems with your code. I will start by stating the obvious ones to more complex ones:

You are pickling the clf2 before it has learnt anything. (ie. you pickle it as soon as it is defined, it doesnt serve any purpose). If you are only testing, then fine. Otherwise they should be pickled after the fit() or equivalent calls.
You are calling clf2.fit() before the clf2.partial_fit(). This defeats the whole purpose of partial_fit(). When you call fit(), you essentially fix the classes (labels) that the model will learn about. In your case it is acceptable, because on your subsequent call to partial_fit() you are giving the same labels. But still it is not a good practice.

See this for more details

In a partial_fit() scenario, dont call the fit() ever. Always call the partial_fit() with your starting data and new coming data. But make sure that you supply all the labels you want the model to learn in the first call to parital_fit() in a parameter classes.
Now the last part, about your tfidf_vectorizer. You call fit_transform()(which is essentially fit() and then transformed() combined) on tfidf_vectorizer with comments array. That means that it on subsequent calls to transform() (as you did in transform(new_comments)), it will not learn new words from new_comments, but only use the words which it saw during the call to fit()(words present in comments).

Same goes for LabelEncoder and sents.

This again is not prefereble in a online learning scenario. You should fit all the available data at once. But since you are trying to use the partial_fit(), we assume that you have very large dataset which may not fit in memory at once. So you would want to apply some sort of partial_fit to TfidfVectorizer as well. But TfidfVectorizer doesnt support partial_fit(). In fact its not made for large data. So you need to change your approach. See the following questions for more details:-
- Updating the feature names into scikit TFIdfVectorizer
- How can i reduce memory usage of Scikit-Learn Vectorizers?

All things aside, if you change just the tfidf part of fitting the whole data (comments and new_comments at once), you will get your desired results.

See the below code changes (I may have organized it a bit and renamed vec_new_comments to new_tfidf, please go through it with attention):

comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']

new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']

tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()

# The below lines are important

# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)

# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents)

Below is the Not so preferred code (which you are using, and about which I talked in point 2), but results are good as long as you make the above changes.

tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)

clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]

new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)

clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]     As you wanted

Correct approach, or the way partial_fit() is intended to be used:

# Declare all labels that you want the model to learn
# Using classes learnt by labelEncoder for this
# In any calls to `partial_fit()`, all labels should be from this array only

all_classes = le.transform(le.classes_)

# Notice the parameter classes here
# It needs to present first time
clf2.partial_fit(tfidf, labels, classes=all_classes)
print(clf2.predict(tfidf))
# [0 2 1]

# classes is not present here
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]

thanks a lot for the support I finally overcome this situation — neo33, Apr 17 '17 at 17:06

Why the following partial fit is not working property?

1 Answers1