0

I am looking for an algorithm that tries to check

1) the similarity of sentences (around 5000) with each other in a document

2) the similarity of multiple documents (around 5000) with respect to each other

I need the same because I'm trying to evaluate whether the text documents/ sentences coming under a particular category are in any manner similar to each other . Are there any existing methods for doing the same.

  • There are (as noted by @Anony-Mousse below several approaches, Standard one being TF-IDF normalization and then calculation cosine similarity. Have you tried something? What language are you planning to use (R, Python, etc.)? Do you just want a pointer in a specific direction or do you have a more specific Problem? – Umberto May 24 '17 at 05:41
  • Does this answer your question? [How to compute the similarity between two text documents?](https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents) – Geremia Jan 15 '23 at 22:08
  • TF-IDF wouldn't take word order into account. – Geremia Jan 15 '23 at 22:09

1 Answers1

2

The standard approach is to use cosine similarity, with TF-IDF normalization.

There are many variants of this, you will need to experiment what works best for you.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194