20

Is it possible to sort an array with Unicode / UTF-8 characters in PHP using a natural order algorithm? For example (the order in this array is correctly ordered):

$array = array
(
    0 => 'Agile',
    1 => 'Ágile',
    2 => 'Àgile',
    3 => 'Âgile',
    4 => 'Ägile',
    5 => 'Ãgile',
    6 => 'Test',
);

If I try with asort($array) I get the following result:

Array
(
    [0] => Agile
    [6] => Test
    [2] => Àgile
    [1] => Ágile
    [3] => Âgile
    [5] => Ãgile
    [4] => Ägile
)

And using natsort($array):

Array
(
    [2] => Àgile
    [1] => Ágile
    [3] => Âgile
    [5] => Ãgile
    [4] => Ägile
    [0] => Agile
    [6] => Test
)

How can I implement a function that returns the correct result order (0, 1, 2, 3, 4, 5, 6) under PHP 5? All the multi byte string functions (mbstring, iconv, ...) are available on my system.

EDIT: I want to natsort() the values, not the keys - the only reason why I'm explicitly defining the keys (and using asort() instead of sort()) is to ease the job of finding out where the sorting of unicode values went wrong.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Alix Axel
  • 151,645
  • 95
  • 393
  • 500

5 Answers5

25

The question is not as easy to answer as it seems on the first look. This is one of the areas where PHP's lack of unicode supports hits you with full strength.

Frist of all natsort() as suggested by other posters has nothing to do with sorting arrays of the type you want to sort. What you're looking for is a locale aware sorting mechanism as sorting strings with extended characters is always a question of the used language. Let's take German for example: A and Ä can sometimes be sorted as if they were the same letter (DIN 5007/1), and sometimes Ä can be sorted as it was in fact "AE" (DIN 5007/2). In Swedish, in contrast, Ä comes at the end of the alphabet.

If you don't use Windows, you're lucky as PHP provides some functions to exactly this. Using a combination of setlocale(), usort(), strcoll() and the correct UTF-8 locale for your language, you get something like this:

$array = array('Àgile', 'Ágile', 'Âgile', 'Ãgile', 'Ägile', 'Agile', 'Test');
$oldLocal = setlocale(LC_COLLATE, '<<your_RFC1766_language_code>>.utf8');
usort($array, 'strcoll');
setlocale(LC_COLLATE, $oldLocal);

Please note that it's mandatory to use the UTF-8 locale variant in order to sort UTF-8 strings. I reset the locale in the example above to its original value as setting a locale using setlocale() can introduce side-effects in other running PHP script - please see PHP manual for more details.

When you do use a Windows machine, there is currently no solution to this problem and there won't be any before PHP 6 I assume. Please see my own question on SO targeting this specific problem.

Community
  • 1
  • 1
Stefan Gehrig
  • 82,642
  • 24
  • 155
  • 189
  • 1
    Great insight, I'm developing on Windows but this will run on *nix machines. If I'm not mistaken PHP 5.3 will support this kind of sorting though some kind of class however I want to refrain myself from relying on set_locale() for mostly two reasons: 1) it's unpredictable (depends on the locales the OS has available) and 2) it's not thread-safe and may cause unexpected behavior on the server. – Alix Axel May 07 '09 at 07:13
  • Sorting using a multi byte version of the ord() function gives me the exactly same results as a simple sort(). =( – Alix Axel May 07 '09 at 07:17
  • 1
    Regarding your first comment: you're absolutely right, that the solution presented in my answer is not the one, one might expect as it's neither portable nor free of side-effects. But: it's the only one right now - besides implementing your own sorting on a character and byte level using for example ext/mbstring. – Stefan Gehrig May 07 '09 at 07:28
  • Regarding my second comment, I used the mbstring extension to code a multi byte equivalent of the original PHP ord() function, but the results it gave me where the same as the sort() function. – Alix Axel May 08 '09 at 03:14
  • Can MySQL be used to provide a steady workaround to this problem? – Alix Axel May 08 '09 at 03:14
  • 1
    Yes sorting the data on the MySQL server would be a feasible workaorund. MySQL does not suffer from those limitations. You can control sort-order by choosing the right collaction for your data. – Stefan Gehrig May 08 '09 at 06:40
13

Nailed it!

$array = array('Ägile', 'Ãgile', 'Test', 'カタカナ', 'かたかな', 'Ágile', 'Àgile', 'Âgile', 'Agile');

function Sortify($string)
{
    return preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1' . chr(255) . '$2', htmlentities($string, ENT_QUOTES, 'UTF-8'));
}

array_multisort(array_map('Sortify', $array), $array);

Output:

Array
(
    [0] => Agile
    [1] => Ágile
    [2] => Âgile
    [3] => Àgile
    [4] => Ãgile
    [5] => Ägile
    [6] => Test
    [7] => かたかな
    [8] => カタカナ
)

Even better:

if (extension_loaded('intl') === true)
{
    collator_asort(collator_create('root'), $array);
}

Thanks to @tchrist!

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • 3
    Sounds like what you really need here is the Unicode Collation Algorithm (UCA). I have a Perl demonstration of that [in this answer](http://stackoverflow.com/questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python/5024116#5024116), where I provide a shell-callable version of it for those who may not have a proper library to call. Perhaps that might help here, too. – tchrist Feb 19 '11 at 13:07
  • @tchrist: UCA is what I'm looking for, I'll take a closer look at your answer in a bit, thanks for the heads up! ;) – Alix Axel Feb 20 '11 at 01:28
1
natsort($array);
$array = array_values($array);
JW.
  • 50,691
  • 36
  • 115
  • 143
0

I've also another workaround for those setlocale doesn't work and don't have the intl module enabled:

// The array to be sorted
$countries = array(
  'AT' => Österreich,
  'DE' => Deutschland,
  'CH' => Schweiz,
);

// Extend this array to your needs.
$utf_sort_map = array(
  "ä" => "a",
  "Ä" => "A",
  "Å" => "A",
  "ö" => "o",
  "Ö" => "O",
  "ü" => "u",
  "Ü" => "U",
);

uasort($my_array, function($a, $b) use ($utf_sort_map) {
  $initial_a = mb_substr($a, 0, 1);
  $initial_b = mb_substr($b, 0, 1);

  if (isset($utf_sort_map[$initial_a]) || isset($utf_sort_map[$initial_b])) {
    if (isset($utf_sort_map[$initial_a])) {
      $initial_a = $utf_sort_map[$initial_a];
    }

    if (isset($utf_sort_map[$initial_b])) {
      $initial_b = $utf_sort_map[$initial_b];
    }

    if ($initial_a == $initial_b) {
      return mb_substr($a, 1) < mb_substr($b, 1) ? -1 : 1;
    }
    else {
      return $initial_a < $initial_b ? -1 : 1;
    }
  }

  return $a < $b ? -1 : 1;
});
Елин Й.
  • 953
  • 10
  • 25
0

I struggled with asort with this issue.

Sorting:

Array
(
    [xa] => África
    [xo] => Australasia
    [cn] => China
    [gb] => Reino Unido
    [us] => Estados Unidos
    [ae] => Emiratos Árabes Unidos
    [jp] => Japón
    [lk] => Sri Lanka
    [xe] => Europa Del Este
    [xw] => Europa Del Oeste
    [fr] => Francia
    [de] => Alemania
    [be] => Bélgica
    [nl] => Holanda
    [es] => España
)

put África at the end. I solved it with this dirty little piece of code (which is fit for my purpose and my timeframe):

$sort = array();
foreach($retval AS $key => $value) {
    $v = str_replace('ä', 'a', $value);
    $v = str_replace('Ä', 'A', $v);
    $v = str_replace('Á', 'A', $v);
    $v = str_replace('é', 'e', $v);
    $v = str_replace('ö', 'o', $v);
    $v = str_replace('ó', 'o', $v);
    $v = str_replace('Ö', 'O', $v);
    $v = str_replace('ü', 'u', $v);
    $v = str_replace('Ü', 'U', $v);
    $v = str_replace('ß', 'S', $v);
    $v = str_replace('ñ', 'n', $v);
    $sort[] = "$v|$key|$value";
}
sort($sort);

$retval = array();
foreach($sort AS $value) {
    $arr = explode('|', $value);
    $retval[$arr[1]] = $arr[2]; 
}
Anthony
  • 47
  • 1
  • Are you French? You might wanna check my answer to this question, my `preg_replace` approach does the transliteration a little better and the `array_multisort` function also preserves the association of values and non-numeric keys. – Alix Axel Apr 19 '11 at 15:25