4

I want to get the count of a word in a given sentence using only tf*idf matrix of a set of sentences. I use TfidfVectorizer from sklearn.feature_extraction.text.

Example :

from sklearn.feature_extraction.text import TfidfVectorizer

sentences = ("The sun is shiny i like the sun","I have been exposed to sun")
vect = TfidfVectorizer(stop_words="english",lowercase=False)
tfidf_matrix = vect.fit_transform(sentences).toarray()

I want to be able to calculate the number of times the term "sun" occurs in the first sentence (which is 2) using only tfidf_matrix[0] and probably vect.idf_ . I know there are infinite ways to get term frequency and words count but I have a special case where I only have a tfidf matrix. I already tried to divide the tfidf value of the word "sun" in the first sentence by its idf value to get tf. Then I multiplied tf by the total number of words in the sentence to get the words count. Unfortunately, I get wrong values.

user2552861
  • 45
  • 1
  • 5
  • Are you able to fit another tfidf matrix? There is an option `use_idf` that you can set to `False`. – rabbit Aug 27 '15 at 15:24
  • Actually, I can't. But, let's presume that I can. Setting `use_idf` to `False` will allow me to have the term frequencies (which I already can get by dividing the tf*idf value by the idf value). How would I calculate the word count from the term frequency value? – user2552861 Aug 27 '15 at 15:46
  • Sorry, I realize now that I mis-understood the prompt. Are you also normalizing the term vectors? – rabbit Aug 27 '15 at 15:52
  • No problem.. Yes, the default norm value is set to "l2"so yes term vectors are normalized. – user2552861 Aug 27 '15 at 15:58

1 Answers1

4

The intuitive thing to do would be exactly what you tried: multiply each tf value by the number of words in the sentence you're examining. However, I think the key observation here is that each row has been normalized by its euclidean length. So multiplying each row by the number of words in that sentence is at best approximating the denormalized row, which is why you get weird values. AFAIK, you can't denormalize the tf*idf matrix without knowing the norms of each of the original rows ahead of time. This is primarily because there are an infinite number of vectors that can be mapped to any one normalized vector. So without the norms, you can't retrieve the correct magnitude of the original vector. See this answer for more details about what I mean.

That being said, I think there's a workaround in our case. We can at least retrieve the normalized ratios of the term counts in each sentence, i.e., sun appears twice as much as shiny. I found that normalizing each row so that the sum of the tf values is 1 and then multiplying those values by the length of the stopword-filtered sentences seems to retrieve the original word counts.

To demonstrate:

sentences = ("The sun is shiny i like the sun","I have been exposed to sun")
vect = TfidfVectorizer(stop_words="english",lowercase=False)
mat = vect.fit_transform(sentences).toarray()
q = mat / vect.idf_
sums = np.ones((q.shape[0], 1))
lens = np.ones((q.shape[0], 1))
for ix in xrange(q.shape[0]):
    sums[ix] = np.sum(q[ix,:])
    lens[ix] = len([x for x in sentences[ix].split() if unicode(x) in vect.get_feature_names()]) #have to filter out stopwords
sum_to_1 = q / sums
tf = sum_to_1 * lens
print tf

yields:

[[ 1.  0.  1.  1.  2.]
 [ 0.  1.  0.  0.  1.]]

I tried this with a few more complicated sentences and it seems to work alright. Let me know if I missed anything.

Community
  • 1
  • 1
rabbit
  • 1,476
  • 12
  • 16