0

I want to normalize a set of unicode characters. However when the characters have more than one accents the canonical normalization doesn't work as expected. The test code I used is below:

<?php
function print_codes($s) {
    $chars = preg_split('//u', $s, -1, PREG_SPLIT_NO_EMPTY);
    echo implode(' ', $chars) . '<br>';
}

$k1 = 'කො';
$k2 = 'කො';
$k3 = 'කාෙ';

print_codes($k1);
print_codes($k2);
print_codes($k3);

echo '*Normalizer*<br>';
$k1 = normalizer_normalize($k1, Normalizer::FORM_C);
$k2 = normalizer_normalize($k2, Normalizer::FORM_C);
$k3 = normalizer_normalize($k3, Normalizer::FORM_C);

print_codes($k1);
print_codes($k2);
print_codes($k3);
?>

Note that the k3 does not normalize as expected.

P.S. although the three variables might appear to have same character, it has different accent ordering, please copy and paste on a unicode enabled editor).

Gayan Dasanayake
  • 1,933
  • 2
  • 17
  • 22
  • The three samples are rendered identically on chrome and Mozilla latest versions. I have a database of words which have words entered differently. The main issue was characters with different accent ordering. They need to be normalized for searching over the database. – Gayan Dasanayake Jul 11 '16 at 04:06
  • The order of U+0DCF and U+0DD9 should be normalized in this example. k2 does normalize to k1, but k3 does not. – Gayan Dasanayake Jul 11 '16 at 04:14
  • `U+0DCF` and `U+0DD9` are `ccc=0`, so their order should not change. [This page](http://unicode.org/cldr/utility/transform.jsp?a=NFC&b=%E0%B6%9A%E0%B7%9C%0D%0A%E0%B6%9A%E0%B7%99%E0%B7%8F%0D%0A%E0%B6%9A%E0%B7%8F%E0%B7%99%0D%0A) shows the correct normalisation. – 一二三 Jul 11 '16 at 06:41
  • Possible duplicate of [Why Normalizer::normallize (PHP) doesn't works?](http://stackoverflow.com/questions/18527704/why-normalizernormallize-php-doesnt-works) – Paul Sweatte Aug 16 '16 at 05:28

0 Answers0