I am trying to find the best way to detect/remove duplicates in text data. By duplicates I mean those texts that have a really high similarity, for example all equal but in one sentence. Furthermore the length can vary (by one or two sentence more or less), for this reason Hamming distances is not an option. Any way to compute a similarity factor? should I use term frequency matrices?
About my data: I have it in JSON file with Date, title and body (content). Therefore the similarity coefficient could include this three levels.
Since I am looking for the approach (not the code) I do not think presenting the data is necessary.
kind regards,