6

I am trying out this code

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

train_data = ["football is the sport","gravity is the movie", "education is imporatant"]
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                                 stop_words='english')

print "Applying first train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()

print "\n\nApplying second train data"
train_data = ["cricket", "Transformers is a film","AIMS is a college"]
X_train = vectorizer.transform(train_data)
print vectorizer.get_feature_names()

print "\n\nApplying fit transform onto second train data"
X_train = vectorizer.fit_transform(train_data)
print vectorizer.get_feature_names()

The output for this one is

Applying first train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']


Applying second train data
[u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport']


 Applying fit transform onto second train data
[u'aims', u'college', u'cricket', u'film', u'transformers']

I gave the first set of data using fit_transform to vectorizer so it gave me feature names like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport'] after that i applied another train set to the same vectorizer but it gave me the same feature names as I didnt use fit or fit_transform. But I want to know how to update the features of a vectorizer without overwriting the previous oncs. If I use fit_transform again the previous features will get overwritten. So I want to update the feature list of the vectorizer. So i want something like [u'education', u'football', u'gravity', u'imporatant', u'movie', u'sport',u'aims', u'college', u'cricket', u'film', u'transformers'] How can I get that.

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
Gunjan
  • 2,775
  • 27
  • 30

2 Answers2

5

In sklearn terminology, this is called a partial fit and you can't do it with a TfidfVectorizer. There are two ways around this:

  • Concatenate the two training sets and re-vectorize
  • use a HashingVectorizer, which support partial fitting. However, that does not have a get_feature_names method due to the fact that is hashes features, so the original isn't kept. Another advantage is that this is much more memory efficient.

Example of the first approach:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

train_data1 = ["football is the sport", "gravity is the movie", "education is important"]
vectorizer = TfidfVectorizer(stop_words='english')

print("Applying first train data")
X_train = vectorizer.fit_transform(train_data1)
print(vectorizer.get_feature_names())

print("\n\nApplying second train data")
train_data2 = ["cricket", "Transformers is a film", "AIMS is a college"]
X_train = vectorizer.transform(train_data2)
print(vectorizer.get_feature_names())

print("\n\nApplying fit transform onto second train data")
X_train = vectorizer.fit_transform(train_data1 + train_data2)
print(vectorizer.get_feature_names())

Output:

Applying first train data
['education', 'football', 'gravity', 'important', 'movie', 'sport']

Applying second train data
['education', 'football', 'gravity', 'important', 'movie', 'sport']

Applying fit transform onto second train data
['aims', 'college', 'cricket', 'education', 'film', 'football', 'gravity', 'important', 'movie', 'sport', 'transformers']
mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • Both of your approaches works good. I found HashingVectorizer useful for my purpose. Thankz for the answer :) – Gunjan Aug 06 '14 at 12:40
3

I found this question while googling for the same issue that OP raised. Like mbatchkarov said Scikit-Learn's TfidfVectorizer doesn't natively support partial fitting.

HashingVectorizer is usually a great alternative, but it really depends on your use-case. Specifically, if you care very much about representing infrequent terms precisely, then collisions will hurt performance.

So I went ahead and wrote my own implementation of "partial_fit" for both TfidfVectorizer and CountVectorizer (see here). Hope it's useful for other people reaching this post. Note that this kind of partial fitting does change the dimension of the output vector given by the vectorizer since the whole point is to update the vocabulary (so take this into account when using in a pipeline).

Ido S
  • 1,304
  • 10
  • 11
  • 1
    Cool implementation. You should try submitting it as a PR to sklearn to see if that's something that might be useful upstream. – Cerin Apr 26 '20 at 03:24