Sorting strings that contain non-ASCII characters

Question

Here's a bare bones scenario:

I start ipython3 from a shell prompt like so:

% LANG='fr_FR.UTF-8' ipython3

I confirm in ipython that the locale has changed from my default en_US.UTF-8 locale to the French equivalent.

print(locale.getlocale())

('fr_FR', 'UTF-8')`

I create the following dictionary:

dict = {'O':'3', 'É':'2', 'Œ':'4', 'E':'1', 'Z':'5'}

I proceed to print the entries of dict sorted by key value:

for k, v in sorted(dict.items()):

... print(k, v)

And this is the output:

If I understand the French collating rules this should have been:

In other words a student of the French language would expect to find the word Œuf (egg) somewhere between Odeur and Offense when looking it up in a dictionary… not pushed out of the way after all the words that happen to start with ASCII letters A-Z. Same with words that contains such letters as 'é','ê', 'è', 'ô' to name a few.

I this to be expected… and if so how would I work around this difficulty?

you could make a custom comparator function to pass into `sorted`? — JoshuaF, Nov 30 '20 at 22:03
Try this: https://stackoverflow.com/Questions/1097908/how-do-i-sort-unicode-strings-alphabetically-in-python — Dani Mesejo, Nov 30 '20 at 22:06
Hmm… custom comparator… as in reinventing the wheel? And a different wheel for each and every language I might happen to process? I was aware of Question #1097908 but I found it a bit odd that something as basic as sorting would be taking care of natively by python without having to use an exotic library… I thought switching to the appropriate locale and running sort() would suffice and save me the trouble of having to understand all the refinements of each and every language's sorting peculiarities… especially since I have no idea where to look for them. — guv', Nov 30 '20 at 23:32
The little tame-saving tool I'm working on is supposed to create an HTML version of an index of sort from the latex version. I used French as an example because that happened to be the language that revealed that the tool was not working correctly. My reasoning was that the French locale was put together by folks who knew what they were doing and had researched the subject. In French there are four accented E's: 'é', 'è', 'ê', and 'ë'… and I have no clue how they should ordered! So it looks like my best bet is to take another look at the IBM ICU library and see if I can get that to work… — guv', Nov 30 '20 at 23:41

Sorting strings that contain non-ASCII characters

0 Answers0