Similarity of a group of text documents

Question

I am looking for an algorithm that tries to check

1) the similarity of sentences (around 5000) with each other in a document

2) the similarity of multiple documents (around 5000) with respect to each other

I need the same because I'm trying to evaluate whether the text documents/ sentences coming under a particular category are in any manner similar to each other . Are there any existing methods for doing the same.

There are (as noted by @Anony-Mousse below several approaches, Standard one being TF-IDF normalization and then calculation cosine similarity. Have you tried something? What language are you planning to use (R, Python, etc.)? Do you just want a pointer in a specific direction or do you have a more specific Problem? — Umberto, May 24 '17 at 05:41
Does this answer your question? [How to compute the similarity between two text documents?](https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents) — Geremia, Jan 15 '23 at 22:08

Has QUIT--Anony-Mousse · Answer 1 · 2017-05-19T06:25:25.173

2

The standard approach is to use cosine similarity, with TF-IDF normalization.

There are many variants of this, you will need to experiment what works best for you.

edited May 19 '17 at 06:25

answered May 17 '17 at 20:42

Has QUIT--Anony-Mousse

76,138
12
138
194

Similarity of a group of text documents

1 Answers1