4

I basically have a two column table containing a primary key and names of companies with about 20,000 rows.

My task is to find all duplicate entries.

I originally tried using soundex, but it would match companies that were completely different, just because they had similar first words. So this led me on to the levenshtein distance algorithm.

The problem is, the query takes an indefinite amount of time. I've left it for about 10 hours now, it still hasn't responded.

Here is the query:

SELECT * 
FROM `Companies` a, `Companies` b 
WHERE levenshtein(a.name, b.name)<5 
AND a.id<>b.id

And here is the levenshtein function I'm using (got it from this post)

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;

What can I do to speed up the query?

Community
  • 1
  • 1
Nikita240
  • 1,377
  • 2
  • 14
  • 29
  • 1
    If there is a very good chance that at least the first character (I guess some could start with a number) of each name is the same if ti represents the same company, you could group them by first letter and find the Levenshtein distance between the names in each group, starting at the second character. That should reduce the number of computations somewhat. – Andrew Morton Jun 08 '14 at 14:49
  • Maybe this is the case of a problems that is best done outside the database. Take a look at http://stevehanov.ca/blog/index.php?id=114 too ... – David Tonhofer Jun 08 '14 at 14:50
  • Is it mandatory to do it in SQL ? How about selecting all the data and passing it through Levenstein in code ? – Radu Murzea Jun 08 '14 at 15:02
  • Andrew Morton, could you demonstrate how I can use a levenstein with a group by? I can't figure out how to execute a levenstein on a group result. – Nikita240 Jun 08 '14 at 16:10
  • 1
    I have no idea. Maybe `WHERE LEFT(a.name, 1) = LEFT(b.name, 1) AND a.id < b.id AND levenshtein(a.name, b.name)<5` would be enough. I assume you have an index on name. – Andrew Morton Jun 08 '14 at 16:46
  • I actually just tried that, the query has been going on for about 45min now, still nothing. – Nikita240 Jun 08 '14 at 17:10

4 Answers4

4

I know at least one optimization that might cut the running time in half:

AND a.id < b.id

This prevents you from testing a=1, b=2 when you've already tested a=2, b=1.

It's still gonna be O(n^2) though, but I can't see how you can do much about that.

wvdz
  • 16,251
  • 4
  • 53
  • 90
4

So I implemented a bunch of suggestions in this thread to reduce my query time.

I indexed the name collumn, changed a.id <> b.id to a.id < b.id to reduce recomparing already compared rows, and added LEFT(a.name, 3) = LEFT(b.name, 3) to prevent executing the heavy levenshtein function on rows that can be easily excluded by the first 3 characters.

This was the query I used:

SELECT * 
FROM `Companies` a, `Companies` b  
WHERE LEFT(a.name, 3) = LEFT(b.name, 3) 
AND a.id < b.id 
AND levenshtein(a.name, b.name)<3

This took about 2 hours to complete, and gave me 964 results. After that, I exported the results as a .csv and imported them into another table, TABLE 2. TABLE 2 is structured like this:

COL 1, COL 2, COL 3, COL 4
a.id, a.name, b.id, b.name

I noticed that there were a lot of results in TABLE 2 that were actually different companies, but were only a couple of characters apart, making the levinshtein distance ineffective at sorting them. For example: "Body FX", "Body Fit", or "Baxco", "Baxyl".

I attempted to filter out more names by comparing RIGHT() on the last 2 characters of the string, but ran into problems as some names were plural, like "Aroostock Medical Center" and "Aroostock Medical Centers". So I wrote my own RIGHT_PLURAL() function that ignored the plural characters.

DROP FUNCTION IF EXISTS RIGHT_PLURAL;
DELIMITER $$
CREATE FUNCTION RIGHT_PLURAL(input VARCHAR(50), right_input INT)
    RETURNS VARCHAR(50)
BEGIN
    DECLARE length INT;
    SET length = LENGTH(input);

    IF RIGHT(input, 2)="'s" THEN
        RETURN SUBSTR(input, length-right_input-1, right_input);
    ELSEIF RIGHT(input, 1)="s" THEN
        RETURN SUBSTR(input, length-right_input, right_input);
    ELSE
        RETURN RIGHT(input, right_input);
    END IF;
END;
$$
DELIMITER ;

I ran

SELECT * 
FROM  `TABLE 2` 
WHERE RIGHT_PLURAL(
`COL 2` , 2
) = RIGHT_PLURAL(
`COL 4` , 2
)

and was down to 893 duplicates. I was satisfied. I copied over the result set to TABLE 3, and ran the following.

DELETE 
FROM `Companies` 
WHERE `id` IN ( SELECT `COL 1` FROM `TABLE 3` )

My database was now largely duplicate free! The only few strays left were due to serious miss-spellings of names.

Nikita240
  • 1,377
  • 2
  • 14
  • 29
3

How similar are the names of the duplicate entries?

If they are exact you could just group by the name:

SELECT REPLACE(name, ' ', '') as name, count(id) as totalDuplicates 
FROM `Companies`
GROUP BY REPLACE(name, ' ', '')

Anything with a count greater than 1 is a duplicate

Darren
  • 79
  • 4
  • Unfortunately they are not that similar. Some of them have the words joined together, different hyphenated characters instead of spaces, you know the deal. "Manual entry" is the word I would use lol. – Nikita240 Jun 08 '14 at 14:36
  • I hear ya! Do they at least have all the same words? I had similar problems in a database I inherited. I was able to clear out a lot of duplicates by using a replace on the institution name and clearing out spaces, hyphens, commas, etc. So the group by was working on a string without white spaces and other garbage. Didn't solve everything but cleaned the list by about 90%. It'll never solve the problem of someone entering UGA instead of University of Georgia, but got me close. – Darren Jun 08 '14 at 14:43
  • Not necessarily. In some instances, 2 words are joined together in a duplicate entry. – Nikita240 Jun 08 '14 at 14:51
1

Google released a tool for cleaning up messy data called 'Refine', http://code.google.com/p/google-refine/ Maybe you could try it and see if it will help in this case.

Chaoley
  • 1,282
  • 15
  • 21