3

I'm using MYSQL's FULL TEXT search functionality (in Mysql 5.6.33).

If I do a MATCH in NATURAL LANGUAGE mode, for a postcode, with a one-character typo, i get some decent results back, including the results with the "right" postcode, but they're not near the top.

For example, there are 10 schools with postcode "BN2 1TL". I deliberately misspell this as "BN2 1TM" and do a search as follows:

SELECT record_id, address_string, 
  MATCH (address_string) AGAINST ("BN2 1TM" IN NATURAL LANGUAGE MODE) AS score 
  FROM schools 
  WHERE MATCH (address_string) AGAINST ("BN2 1TM" IN NATURAL LANGUAGE MODE) > 0 
  ORDER BY score DESC;

On closer inspection, it's because the search has bought back all results that have either "BN2" or "1TM" in their address_string column, and they all have exactly the same score, and so are in random order, effectively. .

This is perfectly reasonable behaviour, but it would be great if I could get the score to take "closeness" into account, meaning that, for a search on "BN2 1TM", "BN2 1TL" would be scored more highly than "BN2 3PQ". Is there a way to do this?

EDIT: I remembered that this type of closeness is technically called "Levenshtein distance", which is a reference to the Levenshtein algorithm for determining how many replacements are needed to turn one string into another. So I guess my question could be like "Can i get MYSQL FULLTEXT NATURAL LANGUAGE MODE scoring to take Levenshtein distance into account"?

O. Jones
  • 103,626
  • 17
  • 118
  • 172
Max Williams
  • 32,435
  • 31
  • 130
  • 197
  • Interesting question. It might help us if you explain why you are abandoning Lucene for MySQL FULLTEXT. Most people with problems like yours abandon MySQL for Lucene when they have problems like yours. Please [edit] your question. – O. Jones Mar 11 '19 at 11:13
  • 1
    @O.Jones I don't mean to be rude, but I actually would prefer **not** to get into a conversation about my motivation for using MySQL FULLTEXT, as it's not relevant to the question. – Max Williams Mar 11 '19 at 11:17
  • 1
    @O.Jones I have removed the reference to Lucene since it possibly (and indeed apparently) distracts from the point of the question. – Max Williams Mar 11 '19 at 11:19
  • 1
    You need [levenshtein distance](https://stackoverflow.com/a/4671557/4964822) and then you can do an orderBy maybe. – nice_dev Mar 11 '19 at 11:26
  • @vivek_23 I just said the same thing in an edit funnily enough. I guess that I would need to order by MATCH score, then a Levenshtein function. Does that exist in MYSQL at all? – Max Williams Mar 11 '19 at 11:28
  • @MaxWilliams No, but you can make a procedure and call it as [shown in this site](http://www.artfulsoftware.com/infotree/qrytip.php?id=552). – nice_dev Mar 11 '19 at 11:33

1 Answers1

3

First off, MySQL fulltext is not nearly as good at open-ended searching as dedicated systems like Lucene.

There's an algorithm, called the Levenshtein distance, that computes the number of transformations of characters -- the distance -- to change one string into another.

So, changing "BN2 1TM" into "BN2 1MT" (a transposition) has a distance of 2. Changing it into "BN2 1TX" has a distance of 1.

Levenshtein distance isn't terribly useful for phrases unless they're almost exactly the same. Changing "Apache Sphinx" into "MySQL FULLTEXT" gives a distance of 14, the length of the longer string. But it's useful for postcodes, partnumbers, and other short structured words.

You could try something like this to get the nearest values first.

  SELECT city, county, postcode
    FROM table
   ORDER BY levenshtein(postcode, 'BN2 1MT') ASC

Then, all you need is a stored function to compute Levenshtein distances. (This isn't built into FULLTEXT.)

From this source, here is such a stored function. But beware, it's not fast and it cannot use an index. So if you can narrow down the search before you do this, you'll get better performance.

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
    RETURNS INT
    DETERMINISTIC
    BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        -- max strlen=255
        DECLARE cv0, cv1 VARBINARY(256);

        SET s1_len = CHAR_LENGTH(s1), 
            s2_len = CHAR_LENGTH(s2), 
            cv1 = 0x00, 
            j = 1, 
            i = 1, 
            c = 0;

        IF s1 = s2 THEN
            RETURN 0;
        ELSEIF s1_len = 0 THEN
            RETURN s2_len;
        ELSEIF s2_len = 0 THEN
            RETURN s1_len;
        ELSE
            WHILE j <= s2_len DO
                SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
            END WHILE;
            WHILE i <= s1_len DO
                SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
                WHILE j <= s2_len DO
                    SET c = c + 1;
                    IF s1_char = SUBSTRING(s2, j, 1) THEN
                        SET cost = 0; ELSE SET cost = 1;
                    END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                    IF c > c_temp THEN SET c = c_temp; END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                    IF c > c_temp THEN
                        SET c = c_temp;
                    END IF;
                    SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
                END WHILE;
                SET cv1 = cv0, i = i + 1;
            END WHILE;
        END IF;
        RETURN c;
    END$$
DELIMITER ;
O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Thanks a lot. I think you're right, I think a good approach would be to get the FULLTEXT results as normal. Then, if there's a "tie" for first place, score-wise, I could run the Levenshtein on the tied-for-first-place results to further order them. – Max Williams Mar 11 '19 at 11:37