37

I am trying to get the tf-idf vector for a single document using Sklearn's TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for any given testing document.

from sklearn.feature_extraction.text import TfidfVectorizer

self.vocabulary = "a list of words I want to look for in the documents".split()
self.vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', 
                 stop_words='english')
self.vect.fit_transform(self.vocabulary)

...

doc = "some string I want to get tf-idf vector for"
tfidf = self.vect.transform(doc)

The problem is that this returns a matrix with n rows where n is the size of my doc string. I want it to return just a single vector representing the tf-idf for the entire string. How can I make this see the string as a single document, rather than each character being a document? Also, I am very new to text mining so if I am doing something wrong conceptually, that would be great to know. Any help is appreciated.

Srikar Appalaraju
  • 71,928
  • 54
  • 216
  • 264
Sterling
  • 3,835
  • 14
  • 48
  • 73

1 Answers1

48

If you want to compute tf-idf only for a given vocabulary, use vocabulary argument to TfidfVectorizer constructor,

vocabulary = "a list of words I want to look for in the documents".split()
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', 
           stop_words='english', vocabulary=vocabulary)

Then, to fit, i.e. calculate counts, with a given corpus, i.e. an iterable of documents, use fit:

vect.fit(corpus)

Method fit_transform is a shortening for

vect.fit(corpus)
corpus_tf_idf = vect.transform(corpus) 

Last, transform method accepts a corpus, so for a single document, you should pass it as list, or it is treated as iterable of symbols, each symbol being a document.

doc_tfidf = vect.transform([doc])
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
alko
  • 46,136
  • 12
  • 94
  • 102
  • 4
    So what is the difference between fit_transform and transform? I've read the documentation, but I don't understand clearly. We use fit_transform to count the occurrences of each term in a list of documents? Then transform...takes those counts and calculates the tf-idf vector for a list of documents? – Sterling Nov 21 '13 at 21:57
  • 6
    @Sterling you use `fit` or `fit_transform` (see update) to train tfidf transformation, and `transform` to apply without counts update – alko Nov 21 '13 at 22:33
  • 1
    when vocabulary param in TfidfVectorizer is an input variable and not inferred from corpus, what is the effect of fitting on a corpus? is it necessary? – Moniba Aug 22 '19 at 20:24