Python not sorting unicode properly. Strcoll doesn't help

Question

I've got a problem with sorting lists using unicode collation in Python 2.5.1 and 2.6.5 on OSX, as well as on Linux.

import locale   
locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]

Which should print:

[u'a', u'ą', u'z']

But instead prints out:

[u'a', u'z', u'ą']

Summing it up - it looks as if strcoll was broken. Tried it with various types of variables (fe. non-unicode encoded strings).

What do I do wrong?

Best regards, Tomasz Kopczuk.

What does `locale.getlocale(LC_COLLATE)` return after your setlocale line? — Amber, Aug 05 '10 at 08:33
The `locale` module uses the locale API from the C library, so if there is an error it must be in the C library. An equivalent test with locale `de_DE.UTF-8` and string `ä` instead of `ą` works correctly. Even if I use the German locale with `ą` the order is correct, so there must be something wrong with the Polish locale implementation in the C library. As a workaround you can convert the string to normalization form D using `unicodedata.normalize`, then even the naive `strcmp` ordering should work. — Philipp, Aug 05 '10 at 08:48
OK, I'm interested in this too. I tried it with `pl_PL.UTF-8` and `de_DE.UTF-8`, and also with `sort(key=locale.strxfrm)` instead of using `strcoll` also on OS X and for the moment am getting your incorrect result. Sting `ä` with de_DE.UTF8 did not work for me. — chryss, Aug 05 '10 at 08:54
Works for me on Linux but not Mac. Maybe OS X's collation tables are wrong, or something? FWIW POSIX locales are dodgy for webapps are they're per-process, not thread safe. — bobince, Aug 05 '10 at 09:02
+1 Works for me on Linux (Ubuntu) but neither on Mac nor FreeBSD. — viam0Zah, Mar 31 '11 at 09:42

score 18 · Accepted Answer · edited Jun 19 '16 at 14:29

18

Apparently, the only way for sorting to work on all platforms is to use the ICU library with PyICU bindings (PyICU on PyPI).

On OS X: sudo port install py26-pyicu, minding bug described here: https://svn.macports.org/ticket/23429 (oh the joy of using macports).

PyICUs documentation is unfortunately severely lacking, but I managed to find out how it's done:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))
print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]

which gives:

[u'a', u'ą', u'z']

Another pro - @bobince: it's thread-safe, so not useless when setting request-wise locales.

edited Jun 19 '16 at 14:29

Kay

797
1
6
28

answered Aug 05 '10 at 09:37

Tomek Kopczuk

2,073
1
14
17

2

Good question, and good answer -- and you're ahead of everyone by a few steps, which is no wonder if you're in Poland :) . Anyhow, this is the second time I've seen issues with Python where it relies on underlying C libraries. Do you know where these could be brought up? – chryss Aug 05 '10 at 09:44
I think it might be a problem with the libraries themselves, rather than Python. But as gnibbler pointed out - it happens to work in some OSes, so maybe, at least this particular issue, has been fixed at some point. OS X is famous for using old gcc and so, and the other OS I tested was Fedora 8 - which itself is not quite contemporary. I would bring this up at one of the mailing lists for the underlying C libraries. Cheers mate :) – Tomek Kopczuk Aug 05 '10 at 09:58
2

I agree. I made a Gist http://gist.github.com/509520 and will give it to a few people to try out. I *love* i18n, but the bugs make it tedious. – chryss Aug 05 '10 at 10:34

chryss · Answer 2 · 2010-08-06T15:35:33.237

6

Just to add to tkopczuk's investigation: This is definitely a gcc bug, at least for version 4.2.1 on OS X 10.6.4. It can be reproduced by calling C strcoll() directly as in this snippet.

EDIT: Still on the same system, I find that for the UTF-8 versions of de_DE, fr_FR, pl_PL, the problem is there, but for the ISO-88591 versions of fr_FR and de_DE, sort order is correct. Unfortunately for the OP, ISO-88592 pl_PL is also buggy:

The order for Polish ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, ISO8859-2.

The order for Polish Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, UTF8.

The order for German Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH DIAERESIS
The LC_COLLATE culture and encoding settings were de_DE, UTF8.

The order for German ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER A WITH DIAERESIS
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were de_DE, ISO8859-1.

The order for Fremch ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were fr_FR, ISO8859-1.

The order for French Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER E WITH ACUTE
The LC_COLLATE culture and encoding settings were fr_FR, UTF8.

edited Aug 06 '10 at 15:35

answered Aug 05 '10 at 23:09

chryss

7,459
37
46

1

Is it possible to decompile `/usr/share/locale/pl_PL.UTF-8/LC_COLLATE` to some sort of readable form? Might not be a gcc bug after all, but wrong collation tables, as @bobince pointed out. – Tomek Kopczuk Aug 06 '10 at 07:17
Well, I get the same behaviour for German and French (ie, characters with diacritics are sorted after "z"), so it's not just the Polish collation tables. I wonder if it doesn't just pick C locale or maybe the default locale (mine is en_GB -- is yours pl_PL?). In any event, it's clearly in the C library, whether in the data or in the code I can't tell. – chryss Aug 06 '10 at 08:09
Yup, mine is pl_PL. But it would be nice to check the collation tables and if they're kosher, then there's the problem with different locale settings being used by the library. But I guess it's the library, hence the problems on various OSes. – Tomek Kopczuk Aug 06 '10 at 14:47
I don't know about how the platform-specific collation tables are made, except that they're supposed to be made from the Common Locale Repository http://cldr.unicode.org/ . The more I look into this, the more I think the C library is a very minimal way to account for locale anyway, and that you're better off using ICU for serious work. Above more testing -- de_DE and fr_FR ISO locales are ok, but pl_PL is also buggy for ISO. – chryss Aug 06 '10 at 15:37
This problem seems to apply to the other German locales as well – i.e. `de_AT`, `de_CH` in addition to `de_DE` – in both their "standalone" and `UTF-8` versions. `ISO8859-1`, `ISO8859-15` seem fine. Operating system: OS X 10.10.5 (Yosemite) – Kay Jun 18 '16 at 22:02

score 5 · Answer 3 · answered May 25 '16 at 14:47

Here is how i managed to sort Persian language correctly (without PyICU)(using python 3.x):

First set the locale (don't forget to import locale and platform)

if platform.system() == 'Linux':
    locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')
elif platform.system() == 'Windows':
   locale.setlocale(locale.LC_ALL, 'Persian_Iran.1256')
else:
   pass (or any other OS)

Then sort using key:

a = ['ا','ب','پ','ت','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف','ق','ک','گ','ل','م','ن','و','ه','ي']

print(sorted(a,key=locale.strxfrm))

For list of Objects:

a = [{'id':"ا"},{'id':"ب"},{'id':"پ"},{'id':"ت"},{'id':"ث"},{'id':"ج"},{'id':"چ"},{'id':"ح"},{'id':"خ"},{'id':"د"},{'id':"ذ"},{'id':"ر"},{'id':"ز"},{'id':"ژ"},{'id':"س"},{'id':"ش"},{'id':"ص"},{'id':"ض"},{'id':"ط"},{'id':"ظ"},{'id':"ع"},{'id':"غ"},{'id':"ف"},{'id':"ق"},{'id':"ک"},{'id':"گ"},{'id':"ل"},{'id':"م"},{'id':"ن"},{'id':"و"},{'id':"ه"},{'id':"ي"}]

print(sorted(a, key=lambda x: locale.strxfrm(x['id']))

Finally you can return the locale:

locale.setlocale(locale.LC_ALL, '')

score 4 · Answer 4 · answered May 22 '13 at 20:49

@gnibbler, using PyICU with the sorted() function does work in a Python3 Environment. After a little digging through the ICU API documentation and some experimentation, I came across the getSortKey() function:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('de_DE.UTF-8'))
sorted(['a','b','c','ä'],key=collator.getSortKey)

which produces the desired collation:

['a', 'ä', 'b', 'c']

instead of the undesired collation:

sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']

score 2 · Answer 5 · answered Jul 10 '13 at 15:12

2

import locale
from functools import cmp_to_key
iterable = [u'a', u'z', u'ą']
sorted(iterable, key=cmp_to_key(locale.strcoll))  # locale-aware sort order

(Ref.: http://docs.python.org/3.3/library/functools.html)

answered Jul 10 '13 at 15:12

Denis St-L

21
1

score 1 · Answer 6 · answered Nov 17 '21 at 17:04

Since 2012 there's been a library natsort. It includes amazing functions such as natsorted and humansorted. More importantly, they work not only with lists!. Code:

from natsort import natsorted, humansorted

lst = [u"a", u"z", u"ą"]
dct = {"ą": 1, "ż": 3, "Ż": 4, "b": 5}

lst_natsorted = natsorted(lst)
lst_humansorted = humansorted(lst)
dct_natsorted = dict(natsorted(dct.items()))
dct_humansorted = dict(humansorted(dct.items()))

print("List natsorted: ", lst_natsorted)
print("List humansorted: ", lst_humansorted, "\n")
print("Dictionary natsorted: ", dct_natsorted)
print("Dictionary humansorted: ", dct_humansorted)

Output:

List natsorted:  ['a', 'ą', 'z']
List humansorted:  ['a', 'ą', 'z']

Dictionary natsorted:  {'Ż': 4, 'ą': 1, 'b': 5, 'ż': 3}  
Dictionary humansorted:  {'ą': 1, 'b': 5, 'ż': 3, 'Ż': 4}

As you can see results differ when sorting dictionaries but considering given list both results are correct.

By the way, this library is also great to sort strings containing numbers:

from natsort import natsorted, humansorted

lst_mixed = ["a9", "a10", "a1", "c4", "c40", "c5"]

mixed_sorted = sorted(lst_mixed)
mixed_natsorted = natsorted(lst_mixed)
mixed_humansorted = humansorted(lst_mixed)

Output:

List with mixed strings sorted:  ['a1', 'a10', 'a9', 'c4', 'c40', 'c5']
List with mixed strings natsorted:  ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']
List with mixed strings humansorted:  ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']

score 0 · Answer 7 · answered Aug 05 '10 at 09:33

0

On ubuntu lucid the sorting with cmp seems to work ok, but my output encoding is wrong.

>>> import locale   
>>> locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
'pl_PL.UTF-8'
>>> print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]
[u'a', u'\u0105', u'z']

Using key with locale.strxfrm does not work unless I am missing something

>>> print [i for i in sorted([u'a', u'z', u'ą'], key=locale.strxfrm)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 0: ordinal not in range(128)

answered Aug 05 '10 at 09:33

John La Rooy

295,403
53
369
502

With strxfrm You have to manually decode the unicode string AFAIK. – Tomek Kopczuk Aug 05 '10 at 09:38
2

@tkopczuk, It would be nice to find a way to sort using `key` as `cmp` for `sorted` is gone in Python3 – John La Rooy Aug 05 '10 at 10:28
1

It seems to be working fine with the provided functools.cmp_to_key function (`from functools import cmp_to_key`), like that: `sorted([u'a', u'z', u'ą'], key=cmp_to_key(collator.compare))` – Tomek Kopczuk Aug 05 '10 at 11:52

Python not sorting unicode properly. Strcoll doesn't help

7 Answers7

Linked

Related