Normalization of Term-frequency and Inverse Document Frequency of varying documents lengths to calculate cosine similarity

Asked Oct 24 '19 at 20:14

Active Oct 24 '19 at 20:14

Viewed 551 times

I have been trying to find thousands of textual document's similarity against one single query. And every document size is majorly varying (from 20 words to 2000 words)

I did refer the question: tf-idf documents of different length

But that doesn't help me because a fraction of cosine value matters too when comparing with a pool of documents to maintain order.

I then came across a wonderful normalization blog: Tf-Idf and Cosine similarity. But the problem here is to tweak in the TermFreq of every document.

I am using sklearn to calculate tf-idf. But now I am looking for some utility similar to sklearn's tf-idf performance. An iterative approach over all the documents to calculate TF and then modify it is not only time consuming but also not efficient.

Any knowledge/suggestions are appreciated.

asked Oct 24 '19 at 20:14

rkatkam

2,634
3
18
29

Normalization of Term-frequency and Inverse Document Frequency of varying documents lengths to calculate cosine similarity

0 Answers0