Sort dictionary by key using locale/collation

Question

The following code is ignoring the locale and Égypt goes at the end, what's wrong?

dict = {"United States": "United States", "Spain" : "Spain", "England": "England", "Égypt": "Égypt"}

import locale

# using your default locale (user settings)
locale.setlocale(locale.LC_ALL,"fr_FR")
print OrderedDict(sorted(dict.items(), key=lambda t: t[0], cmp=locale.strcoll))

That is the output:

OrderedDict([('England', 'England'), ('Spain', 'Spain'), ('United States', 'United States'), ('\xc3\x89gypt', '\xc3\x89gypt')])

@Daniel actually you can... it's just bizarre to do so (the result of key, ends up being passed to cmp) — Jon Clements, Mar 05 '14 at 16:34
The biggest problem here is that it's not clear what encoding `locale` would respect for `fr_FR`. — Martijn Pieters, Mar 05 '14 at 16:43
Even when setting `locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")` to match my terminal settings, Egypt is still sorted last. This is exactly as described in [Python not sorting unicode properly. Strcoll doesn't help](http://stackoverflow.com/q/3412933) and it doesn't matter if I decode to unicode first. This is because collation in locales is broken across platforms. — Martijn Pieters, Mar 05 '14 at 16:49
Thus, conclusion is that this post is a dupe of [Python not sorting unicode properly. Strcoll doesn't help](http://stackoverflow.com/q/3412933), and other posts mentioning PyICU, since that's the correct answer to this problem. — Martijn Pieters, Mar 05 '14 at 16:50
Copypasted your code into my Python 2.7, got correct answer. — Tigran Saluev, Apr 03 '14 at 05:05
For me (on Linux), following works: I added `# -~- coding: utf-8 -~-`, added "Angola" to dict (just to be sure that everything is alright), and changed locale to `"en_US.UTF-8"` as I don't have fr_FR in my system. Result: `OrderedDict([(u'Angola', u'Angola'), (u'\xc9gypt', u'\xc9gypt'), (u'England', u'England'), (u'Spain', u'Spain'), (u'United States', u'United States')])` — MarSoft, Apr 08 '14 at 10:03
Do you happen to be on OSX or BSD? I ask because this is an [open bug](http://bugs.python.org/issue23195) with Python on those systems. — SethMMorton, Aug 17 '16 at 06:16

score 2 · Answer 1 · answered Mar 22 '16 at 19:38

2

Consider the following...

import unicodedata
from collections import OrderedDict
dict = {"United States": "United States", "Spain" : "Spain", "England": "England", "Égypt": "Égypt"}

import locale

# using your default locale (user settings)
locale.setlocale(locale.LC_ALL,"fr_FR")

print OrderedDict(sorted(dict.items(),cmp= lambda a,b: locale.strcoll(unicodedata.normalize('NFD', unicode(a)[0]).encode('ASCII', 'ignore'),
                                                                       unicodedata.normalize('NFD', unicode(b)[0]).encode('ASCII', 'ignore'))))

answered Mar 22 '16 at 19:38

Henry

41
5

There really has to be a better solution than writing a tone of code for just changing the locale. – imrek Apr 19 '16 at 12:12
It should also be noted that setting locale can have impact on other portions of that instance of python. – Marcel Wilson Jun 09 '16 at 14:53
Note: With Python 3 the sorting functions no longer have a `cmp` argument, they just have `key`. You can use `locale.strxfrm` as key to get the same results `cmp` gives with `locale.strcoll`. – Sebastian Riese Apr 01 '22 at 16:46
2

Note 2: Warning changing the locale is global state of the program and can result in unexpected behaviour (it should *never* be done in library code). – Sebastian Riese Apr 01 '22 at 16:47

score -1 · Answer 2 · answered Apr 09 '14 at 16:20

-1

Here's a work-around.

Use unicode's normalization form canonical decomposition http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms

# utf-8 <-> unicode is left as exercise to the reader
egypt = unicodedata.normalize("NFD", egypt)

sorted(['Egypt', 'E\xcc\x81gypt', 'US'])
['Egypt', 'E\xcc\x81gypt', 'US']

This doesn't actually take locale into consideration.

Beyond this, try newer Python (yes I know) or ICU library from Martijn's linked question and respective answers.

answered Apr 09 '14 at 16:20

Dima Tisnek

11,241
4
68
120

This does not solve the problem it will result in wrong orderings for many strings, e.g.: ```>>> sorted(["Égypt", "Example"], key=lambda x: unicodedata.normalize('NFD', x)) ['Example', 'Égypt']```. – Sebastian Riese Apr 01 '22 at 16:52

Sort dictionary by key using locale/collation

2 Answers2

Linked