Does anyone know of a MySQL implementation of the Damerau–Levenshtein distance algorithm as a stored procedure/function that takes a single specified string as a parameter and looks for fuzzy matches of the string in a particular field within a particular table?
I have found various procedure/function code examples that compares two specified strings and works out the distance, but firstly this is only the Levenshtein distance algorithm, and not the Damerau–Levenshtein one, and secondly, I'm not looking to compare two strings but find fuzzy matches in a field of my choosing that are similar to my specified string.
I'm basically trying to put together a fuzzy keyword searcher in MySQL.
Asked
Active
Viewed 6,265 times
3

user1236443
- 549
- 2
- 8
- 19
-
I need an algorithm that's more flexible and not limited to English, and can handle transpositions. Soundex seems to return lots of false postives. – user1236443 Jan 09 '13 at 11:01
-
I think also Double Metaphone was designed for names as opposed to large text. My search field contains a lot of text. – user1236443 Jan 09 '13 at 11:07
-
1Broken link in Waleed's comment, try this: http://dev.mysql.com/doc/refman/5.6/en/string-functions.html#function_soundex – William T. Mallard May 08 '14 at 16:40
3 Answers
3
There is an ongoing development in Github to modify Sean Collins code so it has UTF-8 support and is case-insensitive.
Example:
mysql> select damlevlim('camión', 'çamion', 6);
+--------------------------------------+
| damlevlim('camión', 'çamion', 6) |
+--------------------------------------+
| 0 |
+--------------------------------------+
1 row in set (0.00 sec)
This is specially useful when doing fuzzy matches.
mysql> select word,damlevlim(word, 'camion') as dist from wordslist where damlevlim(word, 'camion', 7)<1 limit 2;
+--------+------+
| word | dist |
+--------+------+
| camión | 0 |
| camios | 1 |
+--------+------+
2 row in set (0.00 sec)

diego
- 529
- 4
- 6
2
In MySQL Levenshtein and Damerau-Levenshtein UDF’s you have several implementations of this algorithm.
-
Thanks. Looking at the example of on how to use them, it appears that it is still comparing the two specified strings. Is this right? – user1236443 Jan 09 '13 at 13:02
-
This algorithm is well described in http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance. – Mihai8 Jan 09 '13 at 13:20
-
The UDF install process doesn't work for me. If I do CREATE FUNCTION damlev RETURNS INTEGER SONAME 'damlev.dll'; I get Error Code: 1126. Can't open shared library 'damlev.dll' (errno: 126 The specified module could not be found.) – user1236443 Jan 09 '13 at 13:20
-
Actually got it to work so will play around with it. Thanks for the link. – user1236443 Jan 09 '13 at 13:25
-
In fact, do you know how I can use the damlev function to search a string in a field that have a distance of 5 or less. On the link provided, there is an example using the Levenshtein function SELECT * FROM mytable WHERE Levenshtein("cow",mytable.field) < = 5. Can you do then same with the demlev function? – user1236443 Jan 09 '13 at 13:33
-
In http://stackoverflow.com/questions/10727174/dameraulevenshtein-distance-edit-distance-with-transposition-c-implementation you have a discussion about implementation of this algorithm. – Mihai8 Jan 09 '13 at 13:36
-
For others looking for word search (and not string compare) check out Lucene. http://lucene.apache.org/ – William T. Mallard May 08 '14 at 16:37
2
This seems to be an old topic, however should anyone look for a MYSQL implementation of Damerau-Levenshtein distance, here is my own implementation (based upon a simple Levenshtein found elsewhere on this site), which works fine for strings less than 255 characters long. The third parameter can be set to FALSE to retrieve the basic Levenshtein distance:
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255), dam BOOL)
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char, s2_char CHAR;
-- max strlen=255
DECLARE cv0, cv1, cv2 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
SET s2_char = SUBSTRING(s2, j, 1);
IF s1_char = s2_char THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
IF dam THEN
IF i>1 AND j>1 AND s1_char = SUBSTRING(s2, j-1, 1) AND s2_char = SUBSTRING(s1, i-1, 1) THEN
SET c_temp = CONV(HEX(SUBSTRING(cv2, j-1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
END IF;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
IF dam THEN SET CV2 = CV1; END IF;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END