41

if i have two strings in mysql:

@a="Welcome to Stack Overflow"
@b=" Hello to stack overflow";

is there a way to get the similarity percentage between those two string using MYSQL? here for example 3 words are similar and thus the similarity should be something like:
count(similar words between @a and @b) / (count(@a)+count(@b) - count(intersection))
and thus the result is 3/(4 + 4 - 3)= 0.6
any idea is highly appreciated!

Lina
  • 2,090
  • 4
  • 20
  • 23
  • 2
    A [Levenshtein](http://en.wikipedia.org/wiki/Levenshtein_distance) based (at word level) distance seems like a good algorithm – RichardTheKiwi Mar 16 '11 at 09:45

4 Answers4

50

you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552):

CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END

and for getting it as XX% use this function

CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END
cHao
  • 84,970
  • 20
  • 145
  • 172
Alaa
  • 4,471
  • 11
  • 50
  • 67
  • 6
    For beginners: if you want to correctly run the CREATE FUNCTION statements, you must set DELIMITER in advance. See http://stackoverflow.com/a/6740975/2293304 – Rockallite Jun 10 '15 at 09:10
  • 2
    The updated version is here: http://www.artfulsoftware.com/infotree/qrytip.php?id=552 – Rockallite Jun 11 '15 at 03:53
  • @Rockallite Beware , the updated version only uses VARCHAR(255) and thus only compares the first 255 characters – nl-x Jul 28 '17 at 10:21
  • 2
    It takes forever... don't bother using... tested on MyIsam and LONGTEXT field, server gone. – Martin Zvarík Nov 27 '18 at 12:32
  • Have you ever used this in production? "SELECT levenshtein_ratio('abcd', 'abc') FROM result" returns in my db 4343 records with the value 75. not really usefull. Or what I do wrong :) – Honsa Stunna Mar 18 '19 at 10:10
  • Hello friends, the above function, at all, is not reliable, the position and size of the string is not reliable comparison parameter. See a more precise function at https://gist.github.com/phpenterprise/be9560412824e5b05e1c1b3f38266b57, thanks! – Patrick Otto Jun 25 '19 at 15:36
  • To hopefully save someone a little bit of time: I went through @PatrickOtto's gist's comments and stuck with this one. – The Onin Mar 20 '21 at 19:04
  • What about performances ? – Loenix Jan 17 '22 at 07:25
9

You can try the SOUNDEX algorithm, take a look here :)

SOUNDEX MySQL

EDIT 1:

Maybe this link about natural language processing with MySQL could be useful

Natural Language Full-Text Searches

How to find similar results and sort by similarity?

HTH!

Community
  • 1
  • 1
SubniC
  • 9,807
  • 4
  • 26
  • 33
  • 2
    SELECT SOUNDEX('Welcome to Stack Overflow'); is W42532321614 \n SELECT SOUNDEX(' Hello to Stack Overflow'); is H432321614 \n so what!!what does it mean :( – Lina Mar 16 '11 at 09:10
  • With the same value the pronunciation of the word is the same, you can take a look here for more details https://secure.wikimedia.org/wikipedia/en/wiki/Soundex, You can try the Levenshtein distance aswel to get a numeric value representing the number of changes (insertions, deletions and modification) you have to make in a sentence to looks like other. https://secure.wikimedia.org/wikipedia/en/wiki/Levenshtein_distance – SubniC Mar 16 '11 at 09:14
  • 5
    Take into consideration that SOUNDEX only works with english. – SubniC Mar 16 '11 at 09:16
  • thanks for notifying me :) i have a lot of other languages in my DB :( – Lina Mar 16 '11 at 09:20
  • This kind of algorithms are language dependant... so you need a specific one :) the task you are doing is not a trivial thing. – SubniC Mar 16 '11 at 09:23
  • @SubniC: I disagree, it depends on the approach to the problem. Soundex definitely doesn't work well for other languages and needs specific - per language - solution. Neville's approach seems to work independently of language. – ypercubeᵀᴹ Jun 17 '11 at 22:27
  • Hello friends, the above function, at all, is not reliable, the position and size of the string is not reliable comparison parameter. See a more precise function at https://gist.github.com/phpenterprise/be9560412824e5b05e1c1b3f38266b57, thanks! – Patrick Otto Jun 25 '19 at 15:36
7

I don't think there's a nice, single-step query way to do this - the natural language stuff is designed mostly for "google-like" search, which sounds different to what you're trying to do.

Depending on what you're actually trying to do - I assume you've left out a lot of detail - I would:

  • create a table into which you split each string into words, all in lower case, stripping out spaces and punctuation - in your example, you'd end up with:

    string_id               word
    
    1                       hello
    1                       from
    1                       stack
    1                       overflow
    2                       welcome
    2                       from
    2                       stack
    2                       overflow
    

You can then run queries against that table - e.g.

select count(*)
from  stringWords
where string_id = 2
and word in 
  (select word 
  from stringWords
  where string_id = 1);

gives you the intersection.

You can then create a function or similar to calculate similarity according to your formula.

Not very clean, but it should perform pretty snappily, it's mostly relational, and it should be largely language independent. To deal with possible typos, you could calculate the soundex - this would allow you to compare "stack" with "stak" and see how similar they really are, though this doesn't work reliably for languages other than English.

Neville Kuyt
  • 29,247
  • 1
  • 37
  • 52
1

This might be of help to you if you do not want to write your own algorithms :

http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
  • i tried match against already... it gives sucks results... it return larger result for less similar article!! – Lina Mar 16 '11 at 09:12