4

I have two different text which I want to compare using tfidf vectorization. What I am doing is:

  1. tokenizing each document
  2. vectorizing using TFIDFVectorizer.fit_transform(tokens_list)

Now the vectors that I get after step 2 are of different shape. But as per the concept, we should have the same shape for both the vectors. Only then the vectors can be compared.

What am I doing wrong? Please help.

Thanks in advance.

akshit bhatia
  • 573
  • 6
  • 22
  • 3
    Someone can correct me if I'm mistaken, but generally I think you should not be doing a `fit_transform` on two different bags of words. You should be doing a `fit_transform` on one set, then using the already fitted vectorizer to just do a `transform` on the second set for comparison to the first – G. Anderson Dec 12 '18 at 17:31
  • Possible duplicate of [Similarity between two text documents](https://stackoverflow.com/questions/8897593/similarity-between-two-text-documents) – G. Anderson Dec 12 '18 at 17:44
  • makes sense... I would try again to use transform on the second text instead of fit_transform. Its true that i should use the vocabulary of first document on the second document to check the similarity. Dont know why didnt I think of this before. Thanks – akshit bhatia Dec 12 '18 at 19:09
  • Maybe this will help https://colab.research.google.com/drive/1lxRclJablHF-veuRzWBgJ9gaqMNo6fPa – alvas Dec 13 '18 at 09:34

2 Answers2

4

As G. Anderson already pointed out, and to help the future guys on this, when we use the fit function of TFIDFVectorizer on document D1, it means that for the D1, the bag of words are constructed.

The transform() function computes the tfidf frequency of each word in the bag of word.

Now our aim is to compare the document D2 with D1. It means we want to see how many words of D1 match up with D2. Thats why we perform fit_transform() on D1 and then only the transform() function on D2 would apply the bag of words of D1 and count the inverse frequency of tokens in D2. This would give the relative comparison of D1 against D2.

akshit bhatia
  • 573
  • 6
  • 22
0

I'm one of those later people :)

So my understanding with TF-IDF is the IDF is computed the frequency of the word (or Ngram) in both documents? So comparing what matches with each, doesn't really cover how common the word is in both documents for weeding out common words? Is there a way to do that with Ngrams without the indice error?

ValueError: Shape of passed values is (26736, 1), indices imply (60916, 1)

# Applying TFIDF to vectors
#instantiate tfidVectorizers() 
ngram_vectorizer1 = TfidfVectorizer(ngram_range = (2,2)) #bigrams 1st vector
ngram_vectorizer2 = TfidfVectorizer(ngram_range = (2,2)) #bigrams 2nd
ngram_vectorizert = TfidfVectorizer(ngram_range = (2,2)) #bigrams total
# fit model 
ngram_vector1 = ngram_vectorizer1.fit_transform(text) 
ngram_vector2 = ngram_vectorizer2.fit_transform(text2)
ngram_vectort = ngram_vectorizert.fit_transform(total)
ngramfeatures1 = (ngram_vectorizer1.get_feature_names()) #save feature names
ngramfeatures2 = (ngram_vectorizer2.get_feature_names()) #save feature names
ngramfeaturest = (ngram_vectorizert.get_feature_names())
print("\n\nngramfeatures1 : \n", ngramfeatures1)
print("\n\nngramfeatures2 : \n", ngramfeatures2)
print("\n\nngram_vector1 : \n", ngram_vector1.toarray())
print("\n\nngram_vector2 : \n", ngram_vector2.toarray())


#Compute the IDF values 
first_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
second_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
total_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True) 
first_tfidf_transformer_ngram.fit(ngram_vector1)
second_tfidf_transformer_ngram.fit(ngram_vector2)
total_tfidf_transformer_ngram.fit(ngram_vectort)


# print 1st idf values 
ngram_first_idf = pd.DataFrame(first_tfidf_transformer_ngram.idf_, index=ngram_vectorizer1.get_feature_names(),columns=["idf_weights"]) 
 
# sort ascending 
ngram_first_idf.sort_values(by=['idf_weights'])  #this one should really be looking towards something from the "Total" calculations if I'm understanding it correctly? ```













Kim Ellis
  • 21
  • 3
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-ask). – Community Sep 16 '21 at 22:36
  • I am not sure, what is the thing you want to do, maybe someone else can answer better. I can answer in this way, that TF refers to ratio of count of a term in a document to the total number of words. IDF refers to the ratio of number of documents containing that term to the total occurrence of that term. So TF tells how important that word is to that document, and IDF tells how important that word is to all documents... so when you want to weed common words out, you find IDF, and based on threshold value, you remove it. I like this idea. (read next comment, as cant write more in this) – akshit bhatia Sep 19 '21 at 06:37
  • But understand this, that when you do fit_transform on text1, you are defining a vocabulary from text1. Then you would want to use this vocabulary to calculate idf on text2. What you are doing currently is find different vocabulary for text1 and text2 and hence idf cannot be calculated as there is only 1 document always. Do a transform method on text 2 after applying fit_transform method on text1 and then you will get a better answer. – akshit bhatia Sep 19 '21 at 06:44