1

I have been trying to find thousands of textual document's similarity against one single query. And every document size is majorly varying (from 20 words to 2000 words)

I did refer the question: tf-idf documents of different length

But that doesn't help me because a fraction of cosine value matters too when comparing with a pool of documents to maintain order.

I then came across a wonderful normalization blog: Tf-Idf and Cosine similarity. But the problem here is to tweak in the TermFreq of every document.

I am using sklearn to calculate tf-idf. But now I am looking for some utility similar to sklearn's tf-idf performance. An iterative approach over all the documents to calculate TF and then modify it is not only time consuming but also not efficient.

Any knowledge/suggestions are appreciated.

rkatkam
  • 2,634
  • 3
  • 18
  • 29

0 Answers0