0

I'm looking to display all possible duplicate entries from my MySQL table from column name.

For example, Adobe Photoshop and Photoshop should be shown as 'possible duplicates.'

Is this possible?

stewart715
  • 5,557
  • 11
  • 47
  • 80
  • This is not a simple task. select substring_index('Mike A.',' ',1) for example returns Mike but what does it happen if you have "B. Mike"? I'm afraid that there are a lot of cases you have to consider. – Nicola Cossu Apr 25 '11 at 19:42
  • SOUNDEX, maybe? See: http://stackoverflow.com/questions/4936371/find-rows-with-duplicate-similar-column-values-mysql – magma Apr 25 '11 at 19:43
  • 2
    How do you know that Mike M is a duplicate of Mike A exactly? – Thomas Apr 25 '11 at 19:44
  • 1
    Would "Mike A." be a possible duplicate for "John A."? – AJ. Apr 25 '11 at 19:52
  • @AJ probably not, I would need some form of threshold. I have a large list of 15,000 entries, some of which are semi-duplicates and are unnecessary (and I like a clean database). – stewart715 Apr 25 '11 at 23:31
  • @Thomas, names are probably a bad idea. My aim is to remove things like 'Adobe Photoshop' vs 'Photoshop' – stewart715 Apr 25 '11 at 23:31

1 Answers1

1

You can start with using SOUNDEX(), this will probably do for what you need

The drawbacks of SOUNDEX() are:

  • it's inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
  • the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
  • for MySQL, at least according to the docs, SOUNDEX is broken for unicode input

Example:

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

In any case, an example Levenshtein distance function for MySQL can be found at Levenshtein Distance as a MySQL Stored Function .

Pentium10
  • 204,586
  • 122
  • 423
  • 502