Levenshtein distance on diacritic characters

Question

In PHP I am calculating Levenshtein distance using function levenshtein(). For simple characters it works as expected, but for diacritic characters like in example

echo levenshtein('à', 'a');

it returns "2". In this case only one replacement has to be done, so I expect it to return "1".

Am I missing something?

you probably need a multibyte compatible levenshtein implementation, [try this one](https://gist.github.com/santhoshtr/1710925) (first one on google) — bruchowski, Oct 09 '14 at 06:29
It is explained nicely [here](http://php.net/manual/en/function.levenshtein.php#113702). — vascowhite, Oct 09 '14 at 06:32

score 3 · Answer 1 · answered Oct 09 '14 at 06:36

I thought it may be useful to have this comment from the PHP manual posted as an answer to this question, so here it is:-

The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.

Example with a french accented word : - levenshtein('notre', 'votre') = 1 - levenshtein('notre', 'nôtre') = 2 (huh ?!)

You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.

Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.

Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.

<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
// 
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
    // find all multibyte characters (cf. utf-8 encoding specs)
    $matches = array();
    if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return $str; // plain ascii string

    // update the encoding map with the characters not already met
    foreach ($matches[0] as $mbc)
        if (!isset($map[$mbc]))
            $map[$mbc] = chr(128 + count($map));

    // finally remap non-ascii characters
    return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
    $charMap = array();
    $s1 = utf8_to_extended_ascii($s1, $charMap);
    $s2 = utf8_to_extended_ascii($s2, $charMap);

    return levenshtein($s1, $s2);
}
?>

Results (for about 6000 calls) - reference time core C function (single-byte) : 30 ms - utf8 to ext-ascii conversion + core function : 90 ms - full php implementation : 3000 ms

score 2 · Accepted Answer · answered Oct 09 '14 at 06:34

The default PHP levenshtein(), like many PHP functions, is not multibyte aware. So, when processing strings with Unicode characters, it handles each byte separately and changes two bytes.

There is no multibyte version (i.e. mb_levenshtein()) so you have two options:

1) Re-implement the function yourself, using mb_ functions. Possible example code from a Gist:

<?php
function levenshtein_php($str1, $str2){
    $length1 = mb_strlen( $str1, 'UTF-8');
    $length2 = mb_strlen( $str2, 'UTF-8');
    if( $length1 < $length2) return levenshtein_php($str2, $str1);
    if( $length1 == 0 ) return $length2;
    if( $str1 === $str2) return 0;
    $prevRow = range( 0, $length2);
    $currentRow = array();
    for ( $i = 0; $i < $length1; $i++ ) {
        $currentRow=array();
        $currentRow[0] = $i + 1;
        $c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
        for ( $j = 0; $j < $length2; $j++ ) {
            $c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
            $insertions = $prevRow[$j+1] + 1;
            $deletions = $currentRow[$j] + 1;
            $substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
            $currentRow[] = min($insertions, $deletions, $substitutions);
        }
        $prevRow = $currentRow;
    }
    return $prevRow[$length2];
}

2) Convert your string's Unicode characters to ASCII. If you are specifically wanting to calculate Levenshtein differences from diacritic characters to non-diacritics, though, this is probably not what you want.

I consider 1 option a better suited for my case, as I will have to calculate Levenshtein distance for other unicode characters as well. But I made little corrections to it: 1. `if( $length1 == 0 ) return $length2;` changed to `if( $length2 == 0 ) return $length1;` (because at that point $length1 will always be bigger or equal to $length2). 2. moved `if( $str1 === $str2) return 0;` to start of the function, as it will give result instantly. — Murval, Oct 09 '14 at 08:31
Wouldn't `if( $length1 == 0 ) return $length2;` be best to replace with `if( $length1 == 0 ) return 0;`, since at that point if `$length1` is 0, than both strings are empty. And in fact if you have moved `if( $str1 === $str2) return 0;` to the start, that case was already handled, so after all with your 2nd change the 1st is indifferent, the whole `if( $length1 == 0 ) return 0;` can be thrown out. — Bence Szalai, Feb 13 '20 at 22:43

Levenshtein distance on diacritic characters

2 Answers2

Linked