I thought it may be useful to have this comment from the PHP manual posted as an answer to this question, so here it is:-
The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.
Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)
You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.
Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.
Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms