31
$word = strtolower($_GET['term']); 

$lev = 0;

$q = mysql_query("SELECT `term` FROM `words`"); 
while($r = mysql_fetch_assoc($q)) 
{ 
    $r['term'] = strtolower($r['term']); 

    $lev = levenshtein($word, $r['term']);

    if($lev >= 0 && $lev < 5)
    {
        $word = $r['term'];
    }
}

How can I move all that into just one query? Don't want to have to query through all terms and do the filtering in PHP.

Peter O.
  • 32,158
  • 14
  • 82
  • 96

8 Answers8

59

You need a levenshtein function in MySQL and query like

$word = mysql_real_escape_string($word);
mysql_qery("SELECT `term` FROM `words` WHERE levenshtein('$word', `term`) BETWEEN 0 AND 4");
user229044
  • 232,980
  • 40
  • 330
  • 338
rik
  • 8,592
  • 1
  • 26
  • 21
  • 2
    The queries above will fail unless you change the sql delimiter first. Use `DELIMITER @` before the queries, add your new @ delimiter after entering each query, then change your delimiter back with `DELIMITER ;` – K.A.F. Jun 25 '13 at 20:08
  • 7
    Is it suitable for running against large data sets ? I am trying to run LEVENSHTEIN against 458546 records, query isn't responding. – vishal Aug 29 '13 at 12:13
  • 1
    I get an error when I try to create the function: #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 5 Could someone please pass completely working code? – Pascal Klein Dec 12 '14 at 16:43
  • @vishal No, not in its current form at least. Levenshtein algorithm has a complexity of `O(n*m)`, where `n` is the length of the first word, and `m` is that of the second one. However, I see _several_ ways in which it **could be optimized** if you introduce a **maximum distance parameter**. You first compare two strings if their length difference is greater than the max-distance param. If it is, you _immediately_ return with max-distance value. Then, you check the value of the current distance within each cycle, and if it surpassed the maximum, you return with that value. – John Weisz Feb 11 '15 at 09:49
  • I've got the same issue than Pascal when importing the function : "#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 5 ". I can't find the issue, can someone help ? – Fifi Jul 18 '18 at 06:19
  • I have the same issue...Error Code: 1064 - You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '' at line 5 – Craig Nov 28 '18 at 23:18
11

There are two ways to implement a Levenshtein function in MySQL. The first is to create a STORED FUNCTION which operates much like a STORED TRANSACTION, except it has distinct inputs and an output. This is fine for small datasets, but a little slow on anything approaching several thousand rows.

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC

BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
  RETURN 0;
ELSEIF s1_len = 0 THEN
  RETURN s2_len;
ELSEIF s2_len = 0 THEN
  RETURN s1_len;
ELSE
  WHILE j <= s2_len DO
    SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
  END WHILE;
  WHILE i <= s1_len DO
    SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
    WHILE j <= s2_len DO
    SET c = c + 1;
    IF s1_char = SUBSTRING(s2, j, 1) THEN
      SET cost = 0; ELSE SET cost = 1;
    END IF;
    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
    IF c > c_temp THEN SET c = c_temp; END IF;
      SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
      IF c > c_temp THEN
        SET c = c_temp;
      END IF;
      SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
    END WHILE;
    SET cv1 = cv0, i = i + 1;
  END WHILE;
END IF;

RETURN c;

END//

Store the above code in a .sql file and import it into your database like so:

source /tmp/mysql_udf.sql

The second method is to implement a User Defined Function in C/C++ and link it into MySQL as a shared library (*.so file). This method also uses a STORED FUNCTION to call the library, which means the actual query for this or the first method may be identical (providing the inputs to both functions are the same). You can find out more about this method here: http://samjlevy.com/mysql-levenshtein-and-damerau-levenshtein-udfs/

With either of these methods, your query would be something like:

SELECT term FROM words WHERE levenshtein(term, 'term') < 5;

Also, remember that the 'threshold' value should change in relation to the original word length. It's better to think of it in terms of a percentage value, i.e. half your word = 50%, half of 'term' = 2.

Ian Atkin
  • 6,302
  • 2
  • 17
  • 24
8

If you have a huge database, you can filter the words first using SOUNDEX:

$word = strtolower(mysql_real_escape_string($_GET['term']));

$rs = mysql_query("SELECT LOWER(`term`) FROM `words` WHERE SOUNDEX(term) = SOUNDEX(" . $word . ")");

while ($row = mysql_fetch_assoc($rs)) { 

    $lev = levenshtein($word, $row['term']);

    ....

}

If you have time enough to play with a C extension or procedure, you may achieve better performance, but filtering the records on mysql before applying real levenshtein will make things faster with almost no effort.

carlosvini
  • 1,742
  • 19
  • 18
6

If you are dealing with very large data sets I have found that it is much more efficient to handle the Levenshtein operations and sorting in PHP than it is in MySQL. e.g. query of about 1000 records:

MySQL( ~ 0.0050s) -> PHP Levenshtein( ~ 1.300s)

vs.

MySQL Levenshtein( >= 5.000s) -> PHP( ~ 0.250s)

There are also many other options for optimizing search engines but if you want to use Levenshtein just be aware of the data you'll be handling and the latencies you want.

John Rausch
  • 186
  • 2
  • 5
1

I suggest you including the call of the levenshtein(link: http://www.artfulsoftware.com/infotree/queries.php#552) into your query.

You should use mysqli_query($q) because mysql_query($q) is deprecated and may be removed in future php versions!

$word = mysql_real_escape_string($word);
$query = "SELECT `term` FROM `words` WHERE levenshtein('$word', `term`)   BETWEEN 0 AND 4";
mysqli_qery($query);
Coder55
  • 559
  • 5
  • 17
1

You can make this code look a bit neater but @profitphp is right, you can't doing it in MySQL without a levenstein library.

$word = strtolower($_GET['term']);

$q = mysql_uqery("SELECT LOWER(`term`) FROM `words`");

while($r = mysql_fetch_assoc($q)) { 

    $lev = levenshtein($word, $r['term']);

    ....

}
mvbl fst
  • 5,213
  • 8
  • 42
  • 59
0

I do this in Oracle by implementing the algorithm in PL/SQL inside a function that can be called.

mkj
  • 2,761
  • 5
  • 24
  • 28
Randy
  • 16,480
  • 1
  • 37
  • 55
  • Excellent, so what you are telling us is that this problem requires some code! – Drumbeg Mar 29 '17 at 07:40
  • Sorta- yes - the question was about how to structure the solution 'in one query' - i advise to move the complication to a function to simplify the query itself. – Randy Mar 29 '17 at 17:19
-3

That is one query. If you're asking if you can move the levenshtein functionality to mysql, you can't.

Ok, well you can, but its not any easier than just doing it in php.

http://www.artfulsoftware.com/infotree/queries.php?&bw=1280#552

profitphp
  • 8,104
  • 2
  • 28
  • 21