Detecting duplicates in text files

Question

I am trying to find the best way to detect/remove duplicates in text data. By duplicates I mean those texts that have a really high similarity, for example all equal but in one sentence. Furthermore the length can vary (by one or two sentence more or less), for this reason Hamming distances is not an option. Any way to compute a similarity factor? should I use term frequency matrices?

About my data: I have it in JSON file with Date, title and body (content). Therefore the similarity coefficient could include this three levels.

Since I am looking for the approach (not the code) I do not think presenting the data is necessary.

kind regards,

How can we know, what your **data** looks like ? – ZdaR Mar 29 '16 at 14:43 — ZdaR, Mar 29 '16 at 14:43

score 1 · Accepted Answer · edited May 23 '17 at 11:59

1

You can use the tf-idf ranking method. Look here for more details : Similarity between two text documents

edited May 23 '17 at 11:59

Community

1
1

answered Mar 29 '16 at 14:52

arcticless

664
5
14

Detecting duplicates in text files

1 Answers1