MySQL: how to efficiently select strings that have only one character difference?

Question

Is it possible in MySQL to select all strings from a table that have only one character difference from a given string?

eg. Iphone 5 and Iphone 5s or Model x-1 and Models x1

Must this one character difference be at the end of the string? — Mureinik, Aug 03 '16 at 18:21
It depends, what you mean 1 character difference. Check [this answer](http://stackoverflow.com/questions/3338889/how-to-find-similar-results-and-sort-by-similarity). — Oleg Imanilov, Aug 03 '16 at 18:28

LSerni · Accepted Answer · 2020-07-15T21:52:30.217

If you can add a user function to MySQL, then you could use Levenshtein's distance. See this other question for the code.

Then you can query WHERE LEVENSHTEIN(description, 'iphone 5') <= 2 for example. You would find "iphone 5S", and also "ipohne 5", which might be a plus.

Otherwise, specific cases are easy (e.g, REGEX 'iphone.*' or similar), but the general case would be a nightmare to implement.

Use Levenshtein as a last resort

The cost of "text distance" calculations is staggering. So, before resorting to it, do whatever it takes to reduce the number of calculations you need.

For example, basic Levenshtein assumes insertions and deletions to cost 1 each. This means that "good" and "excellent" will have a minimum distance of 5, just because one is five characters longer. And this information is a difference of two LENGTHs that cost next to nothing.

If you can partition the rows based on the first letter, you can calculate the distance between two remainders of equal length and add 0 if the first letters are equal, 1 if they are not (if the remainders are not of equal length, you risk introducing an error: 'SLAUGHTER' and 'LAUGHTER' are 1 insertion apart, but comparing 'LAUGHTER' and 'AUGHTER' also yield 1, leading to a score of 1+1=2 instead of 1+0=1. 'LAUGHTER' and 'DAUGHTER' instead, being the same length, would yield 1 for the first letter plus the difference between 'AUGHTER' and 'AUGHTER', which is 0).

This method (sometimes called Jordan set reduction) might divide your word list from 10,000 words to say 50 buckets of 200 words each, with compares only between bucket subsets of cardinality 4; that means about 50*(4*200)^2 = 32 million compares instead of 10000^2 = 100 million, a 68% reduction right there.

(Actually, assuming a word length from 5 to 15 and all letters being equally represented, your list would be divided in about 250 buckets of 40 words each; 250*(440)^2 is 6.4 million. If you want a maximum of two character difference, that's 250(2*40)^2 = 1.6 million. The time saved might be well worth the cost of bucketing the words).

Once you have whittled down the numbers of rows where an actual compare needs to be done, then call Levenshtein on the remainder.

Performances

This version is modified to accept a third parameter, MAXCOST. When the maxcost for a complete Levenshtein subloop is reached, the search function will simply abort. This cuts down on those cases where the Levenshtein distance is unreasonable, at the expense of those cases where the distance is reasonable.

So if you want strings with distance less than 3, you would use WHERE levenshtein(string1, string2, 3) < 3. All strings with a minimum single loop distance of 3 or above will now return 3, and be excluded. This makes the function fail safe, i.e. it will never over-report distances, and it will under-report distances only in case of failure (so, if two strings have distance 1, the function will never return more than 1. But if they have distance 14 and the limit is set to 10, it might under-report a distance of 12 instead of 14).

This means that if the great part of your queries currently returns small distances, this function will not be convenient for you. If you have few close matches and a lot of large distances, then this will work.

Test, using short distances. Keep in mind that the unmodified Levenshtein function here will return in 1.49 milliseconds on my machine.

SELECT 
BENCHMARK(1000, levenshtein('mogilifski', 'mogiliski', 999)) AS _,
levenshtein('mogilifski', 'mogiliski', 999) AS result;

1 row in set (1.656 sec)

So, distance 1 will yield a net loss - it is 12% slower. Lowering the limit will yield no change in timings, because the function never reaches higher than 2.

Now let us try with very different strings: 'mogiliski' and 'fridrich'. In this case, the real distance is 9, and lowering the limit from 999 to 3 reduces the execution time from 1.47 to 0.71 ms - a 50% saving.

More extreme cases fare much better: 'Darmok and Jalad at Tanagra' and 'Kiteo, his eyes open' are longer strings, with a distance of 23. Execution time is normally 9.672 ms (it is 9.05 with the unmodified version). Lowering the limit at 5 lowers the execution time to 2.89 ms. Lowering further at 3, since it is immediately obvious that the distance will never be so small, terminates immediately before the first subloop yielding 0.03 ms.

This is the modified code of the original function. You will have to drop the old function before defining this new one, unless you change the name.

DELIMITER $$
DROP FUNCTION IF EXISTS LEVENSHTEIN $$
CREATE FUNCTION LEVENSHTEIN(s1 VARCHAR(255) CHARACTER SET utf8, s2 VARCHAR(255) CHARACTER SET UTF8, maxcost INT)
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost, mincost INT;
    DECLARE s1_char CHAR CHARACTER SET utf8;
    -- max strlen=255 for this function
    DECLARE cv0, cv1 VARBINARY(256);

    SET s1_len = CHAR_LENGTH(s1),
        s2_len = CHAR_LENGTH(s2),
        cv1 = 0x00,
        j = 1,
        i = 1,
        c = 0;

    IF (s1 = s2) THEN
      RETURN (0);
    ELSEIF (s1_len = 0) THEN
      RETURN (s2_len);
    ELSEIF (s2_len = 0) THEN
      RETURN (s1_len);
    END IF;

    IF (ABS(s1_len - s2_len) > maxcost) THEN
        RETURN maxcost;
    END IF;


    WHILE (j <= s2_len) DO
      SET cv1 = CONCAT(cv1, CHAR(j)),
          j = j + 1;
    END WHILE;

    WHILE (i <= s1_len) DO
      SET s1_char = SUBSTRING(s1, i, 1),
          c = i,
          cv0 = CHAR(i),
          j = 1;
      SET mincost = 255;
      WHILE (j <= s2_len) DO
        SET c = c + 1,
            cost = IF(s1_char = SUBSTRING(s2, j, 1), 0, 1);

        SET c_temp = ORD(SUBSTRING(cv1, j, 1)) + cost;
        IF (c > c_temp) THEN
          SET c = c_temp;
        END IF;

        SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
        IF (c > c_temp) THEN
          SET c = c_temp;
        END IF;

        SET cv0 = CONCAT(cv0, CHAR(c)),
            j = j + 1;
        IF (c < mincost) THEN
            SET mincost = c;
         END IF;
      END WHILE;
      IF (mincost > maxcost) THEN
         RETURN (mincost);
      END IF;

      SET cv1 = cv0,
          i = i + 1;
    END WHILE;

    RETURN (c);
  END $$

DELIMITER ;

I did add the function as described but what about performance? What if I had a table of 20,000 rows? — Francesco, Aug 03 '16 at 20:02
i tried and it works well, but i think in my mysql it takes about 5 minutes to give me results :( — Francesco, Aug 03 '16 at 20:13
The function does not work well with even enourmous limits. select levenshtein('mogilifski', 'mogiliski', 10) returns 10 when the distance is 1. If you run select levenshtein('mogilifski', 'mogiliski', 20) you do get 1. But the first case makes the function practically useless. If updated I would gladly change my opinion — Adrián E, Jul 15 '20 at 17:54
@AdriánE you are correct. There was an entire family of edge cases that broke the "quick exit" concept. I have amended the code accordingly. — LSerni, Jul 15 '20 at 21:33
@LSerni I haven't thoroughly tested it, but at least in the cases I tried it now works perfectly! Of course I reversed my vote on the answer and I also leave you a bunch of gratitude for taking the time to correct it!! — Adrián E, Jul 16 '20 at 14:45
@AdriánE **You** have *my* thanks! You don't think I'm *happy* having faulty code going around with my name on it, do you? :-D — LSerni, Jul 16 '20 at 15:58

MySQL: how to efficiently select strings that have only one character difference?

1 Answers1

Use Levenshtein as a last resort

Performances