MySQL show possible duplicates

Question

I'm looking to display all possible duplicate entries from my MySQL table from column name.

For example, Adobe Photoshop and Photoshop should be shown as 'possible duplicates.'

Is this possible?

This is not a simple task. select substring_index('Mike A.',' ',1) for example returns Mike but what does it happen if you have "B. Mike"? I'm afraid that there are a lot of cases you have to consider. — Nicola Cossu, Apr 25 '11 at 19:42
SOUNDEX, maybe? See: http://stackoverflow.com/questions/4936371/find-rows-with-duplicate-similar-column-values-mysql — magma, Apr 25 '11 at 19:43
How do you know that Mike M is a duplicate of Mike A exactly? — Thomas, Apr 25 '11 at 19:44
@AJ probably not, I would need some form of threshold. I have a large list of 15,000 entries, some of which are semi-duplicates and are unnecessary (and I like a clean database). — stewart715, Apr 25 '11 at 23:31
@Thomas, names are probably a bad idea. My aim is to remove things like 'Adobe Photoshop' vs 'Photoshop' — stewart715, Apr 25 '11 at 23:31

score 1 · Accepted Answer · answered Apr 25 '11 at 20:43

You can start with using SOUNDEX(), this will probably do for what you need

The drawbacks of SOUNDEX() are:

it's inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
for MySQL, at least according to the docs, SOUNDEX is broken for unicode input

Example:

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

In any case, an example Levenshtein distance function for MySQL can be found at Levenshtein Distance as a MySQL Stored Function .

MySQL show possible duplicates

1 Answers1