How to detect duplicate text with some fuzzyness

Question

Some thing ago, I write small script using Text::DeDupe to remove duplicates of blog posts before I have to lay my eyes on them.

After reading Syntactic Clustering of the Web paper on which implementation is based, I would love to have ability to find overlapping documents (e.g. snippets of blogs as opposed to full text, maybe also quotes).

Do you know of any other implementation in C, C++ or perl which I can try out before writing my own?

I think you'd have to use classic line-based differencing algorithms: http://stackoverflow.com/questions/236031/how-to-realize-a-diff-function http://stackoverflow.com/questions/145607/text-difference-algorithm http://stackoverflow.com/questions/3144/best-diff-algorithm — Jeff Atwood, Jan 21 '09 at 01:02
This might be too simplistic approach to task at hand since I would like to remove near-duplicates like someone quoting most of post and adding something like "me too" which is just spam. — dpavlin, Apr 26 '10 at 17:42

score 2 · Accepted Answer · answered Apr 26 '10 at 17:44

2

SpotSigs seems to fit my bill just right, here are some references:

The soruce code for this module is hosted on GitHub:

http://github.com/jzawodn/perl-text-spotsig

answered Apr 26 '10 at 17:44

dpavlin

1,372
2
9
18

1

That Jeremy Z. github link isn't the link to the source. If you look at that repo it is empty. The source to SpotSigs can be found here: http://www.mpi-inf.mpg.de/~mtb/ – Nate Murray Jun 14 '11 at 14:16
The page mentioned by Nate has moved, this is the new URL: http://adrem.ua.ac.be/~tmartin/ – Markus Amalthea Magnuson Jul 31 '15 at 13:27

How to detect duplicate text with some fuzzyness

1 Answers1