12

The following code is ignoring the locale and Égypt goes at the end, what's wrong?

dict = {"United States": "United States", "Spain" : "Spain", "England": "England", "Égypt": "Égypt"}

import locale

# using your default locale (user settings)
locale.setlocale(locale.LC_ALL,"fr_FR")
print OrderedDict(sorted(dict.items(), key=lambda t: t[0], cmp=locale.strcoll))

That is the output:

OrderedDict([('England', 'England'), ('Spain', 'Spain'), ('United States', 'United States'), ('\xc3\x89gypt', '\xc3\x89gypt')])
alasarr
  • 1,565
  • 3
  • 16
  • 32
  • 1
    I don't think you can specify both `key` and `cmp`. – Daniel Roseman Mar 05 '14 at 16:34
  • @Daniel actually you can... it's just bizarre to do so (the result of key, ends up being passed to cmp) – Jon Clements Mar 05 '14 at 16:34
  • The biggest problem here is that it's not clear what encoding `locale` would respect for `fr_FR`. – Martijn Pieters Mar 05 '14 at 16:43
  • Even when setting `locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")` to match my terminal settings, Egypt is still sorted last. This is exactly as described in [Python not sorting unicode properly. Strcoll doesn't help](http://stackoverflow.com/q/3412933) and it doesn't matter if I decode to unicode first. This is because collation in locales is broken across platforms. – Martijn Pieters Mar 05 '14 at 16:49
  • 3
    Thus, conclusion is that this post is a dupe of [Python not sorting unicode properly. Strcoll doesn't help](http://stackoverflow.com/q/3412933), and other posts mentioning PyICU, since that's the correct answer to this problem. – Martijn Pieters Mar 05 '14 at 16:50
  • Copypasted your code into my Python 2.7, got correct answer. – Tigran Saluev Apr 03 '14 at 05:05
  • For me (on Linux), following works: I added `# -~- coding: utf-8 -~-`, added "Angola" to dict (just to be sure that everything is alright), and changed locale to `"en_US.UTF-8"` as I don't have fr_FR in my system. Result: `OrderedDict([(u'Angola', u'Angola'), (u'\xc9gypt', u'\xc9gypt'), (u'England', u'England'), (u'Spain', u'Spain'), (u'United States', u'United States')])` – MarSoft Apr 08 '14 at 10:03
  • Do you happen to be on OSX or BSD? I ask because this is an [open bug](http://bugs.python.org/issue23195) with Python on those systems. – SethMMorton Aug 17 '16 at 06:16

2 Answers2

2

Consider the following...

import unicodedata
from collections import OrderedDict
dict = {"United States": "United States", "Spain" : "Spain", "England": "England", "Égypt": "Égypt"}

import locale

# using your default locale (user settings)
locale.setlocale(locale.LC_ALL,"fr_FR")

print OrderedDict(sorted(dict.items(),cmp= lambda a,b: locale.strcoll(unicodedata.normalize('NFD', unicode(a)[0]).encode('ASCII', 'ignore'),
                                                                       unicodedata.normalize('NFD', unicode(b)[0]).encode('ASCII', 'ignore'))))
Henry
  • 41
  • 5
  • There really has to be a better solution than writing a tone of code for just changing the locale. – imrek Apr 19 '16 at 12:12
  • It should also be noted that setting locale can have impact on other portions of that instance of python. – Marcel Wilson Jun 09 '16 at 14:53
  • Note: With Python 3 the sorting functions no longer have a `cmp` argument, they just have `key`. You can use `locale.strxfrm` as key to get the same results `cmp` gives with `locale.strcoll`. – Sebastian Riese Apr 01 '22 at 16:46
  • 2
    Note 2: Warning changing the locale is global state of the program and can result in unexpected behaviour (it should *never* be done in library code). – Sebastian Riese Apr 01 '22 at 16:47
-1

Here's a work-around.

Use unicode's normalization form canonical decomposition http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms

# utf-8 <-> unicode is left as exercise to the reader
egypt = unicodedata.normalize("NFD", egypt)

sorted(['Egypt', 'E\xcc\x81gypt', 'US'])
['Egypt', 'E\xcc\x81gypt', 'US']

This doesn't actually take locale into consideration.

Beyond this, try newer Python (yes I know) or ICU library from Martijn's linked question and respective answers.

Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
  • This does not solve the problem it will result in wrong orderings for many strings, e.g.: ```>>> sorted(["Égypt", "Example"], key=lambda x: unicodedata.normalize('NFD', x)) ['Example', 'Égypt']```. – Sebastian Riese Apr 01 '22 at 16:52