I have been trying to find thousands of textual document's similarity against one single query. And every document size is majorly varying (from 20 words to 2000 words)
I did refer the question: tf-idf documents of different length
But that doesn't help me because a fraction of cosine value matters too when comparing with a pool of documents to maintain order.
I then came across a wonderful normalization blog: Tf-Idf and Cosine similarity. But the problem here is to tweak in the TermFreq of every document.
I am using sklearn
to calculate tf-idf. But now I am looking for some utility similar to sklearn's tf-idf performance. An iterative approach over all the documents to calculate TF and then modify it is not only time consuming but also not efficient.
Any knowledge/suggestions are appreciated.