Is it possible in MySQL to select all strings from a table that have only one character difference from a given string?
eg. Iphone 5
and Iphone 5s
or Model x-1
and Models x1
Is it possible in MySQL to select all strings from a table that have only one character difference from a given string?
eg. Iphone 5
and Iphone 5s
or Model x-1
and Models x1
If you can add a user function to MySQL, then you could use Levenshtein's distance. See this other question for the code.
Then you can query WHERE LEVENSHTEIN(description, 'iphone 5') <= 2
for example. You would find "iphone 5S", and also "ipohne 5", which might be a plus.
Otherwise, specific cases are easy (e.g, REGEX 'iphone.*'
or similar), but the general case would be a nightmare to implement.
The cost of "text distance" calculations is staggering. So, before resorting to it, do whatever it takes to reduce the number of calculations you need.
For example, basic Levenshtein assumes insertions and deletions to cost 1 each. This means that "good" and "excellent" will have a minimum distance of 5, just because one is five characters longer. And this information is a difference of two LENGTH
s that cost next to nothing.
If you can partition the rows based on the first letter, you can calculate the distance between two remainders of equal length and add 0 if the first letters are equal, 1 if they are not (if the remainders are not of equal length, you risk introducing an error: 'SLAUGHTER' and 'LAUGHTER' are 1 insertion apart, but comparing 'LAUGHTER' and 'AUGHTER' also yield 1, leading to a score of 1+1=2 instead of 1+0=1. 'LAUGHTER' and 'DAUGHTER' instead, being the same length, would yield 1 for the first letter plus the difference between 'AUGHTER' and 'AUGHTER', which is 0).
This method (sometimes called Jordan set reduction) might divide your word list from 10,000 words to say 50 buckets of 200 words each, with compares only between bucket subsets of cardinality 4; that means about 50*(4*200)^2 = 32 million compares instead of 10000^2 = 100 million, a 68% reduction right there.
(Actually, assuming a word length from 5 to 15 and all letters being equally represented, your list would be divided in about 250 buckets of 40 words each; 250*(440)^2 is 6.4 million. If you want a maximum of two character difference, that's 250(2*40)^2 = 1.6 million. The time saved might be well worth the cost of bucketing the words).
Once you have whittled down the numbers of rows where an actual compare needs to be done, then call Levenshtein on the remainder.
This version is modified to accept a third parameter, MAXCOST. When the maxcost for a complete Levenshtein subloop is reached, the search function will simply abort. This cuts down on those cases where the Levenshtein distance is unreasonable, at the expense of those cases where the distance is reasonable.
So if you want strings with distance less than 3, you would use WHERE levenshtein(string1, string2, 3) < 3
. All strings with a minimum single loop distance of 3 or above will now return 3, and be excluded. This makes the function fail safe, i.e. it will never over-report distances, and it will under-report distances only in case of failure (so, if two strings have distance 1, the function will never return more than 1. But if they have distance 14 and the limit is set to 10, it might under-report a distance of 12 instead of 14).
This means that if the great part of your queries currently returns small distances, this function will not be convenient for you. If you have few close matches and a lot of large distances, then this will work.
Test, using short distances. Keep in mind that the unmodified Levenshtein function here will return in 1.49 milliseconds on my machine.
SELECT
BENCHMARK(1000, levenshtein('mogilifski', 'mogiliski', 999)) AS _,
levenshtein('mogilifski', 'mogiliski', 999) AS result;
1 row in set (1.656 sec)
So, distance 1 will yield a net loss - it is 12% slower. Lowering the limit will yield no change in timings, because the function never reaches higher than 2.
Now let us try with very different strings: 'mogiliski' and 'fridrich'. In this case, the real distance is 9, and lowering the limit from 999 to 3 reduces the execution time from 1.47 to 0.71 ms - a 50% saving.
More extreme cases fare much better: 'Darmok and Jalad at Tanagra' and 'Kiteo, his eyes open' are longer strings, with a distance of 23. Execution time is normally 9.672 ms (it is 9.05 with the unmodified version). Lowering the limit at 5 lowers the execution time to 2.89 ms. Lowering further at 3, since it is immediately obvious that the distance will never be so small, terminates immediately before the first subloop yielding 0.03 ms.
This is the modified code of the original function. You will have to drop the old function before defining this new one, unless you change the name.
DELIMITER $$
DROP FUNCTION IF EXISTS LEVENSHTEIN $$
CREATE FUNCTION LEVENSHTEIN(s1 VARCHAR(255) CHARACTER SET utf8, s2 VARCHAR(255) CHARACTER SET UTF8, maxcost INT)
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost, mincost INT;
DECLARE s1_char CHAR CHARACTER SET utf8;
-- max strlen=255 for this function
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1),
s2_len = CHAR_LENGTH(s2),
cv1 = 0x00,
j = 1,
i = 1,
c = 0;
IF (s1 = s2) THEN
RETURN (0);
ELSEIF (s1_len = 0) THEN
RETURN (s2_len);
ELSEIF (s2_len = 0) THEN
RETURN (s1_len);
END IF;
IF (ABS(s1_len - s2_len) > maxcost) THEN
RETURN maxcost;
END IF;
WHILE (j <= s2_len) DO
SET cv1 = CONCAT(cv1, CHAR(j)),
j = j + 1;
END WHILE;
WHILE (i <= s1_len) DO
SET s1_char = SUBSTRING(s1, i, 1),
c = i,
cv0 = CHAR(i),
j = 1;
SET mincost = 255;
WHILE (j <= s2_len) DO
SET c = c + 1,
cost = IF(s1_char = SUBSTRING(s2, j, 1), 0, 1);
SET c_temp = ORD(SUBSTRING(cv1, j, 1)) + cost;
IF (c > c_temp) THEN
SET c = c_temp;
END IF;
SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
IF (c > c_temp) THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, CHAR(c)),
j = j + 1;
IF (c < mincost) THEN
SET mincost = c;
END IF;
END WHILE;
IF (mincost > maxcost) THEN
RETURN (mincost);
END IF;
SET cv1 = cv0,
i = i + 1;
END WHILE;
RETURN (c);
END $$
DELIMITER ;