3

I am trying to sort a list of strings using the sorted() function. The problem is that I am using (french) accent. I have tried:

import locale
import functools

locale.setlocale(locale.LC_ALL, 'fr_FR')
test=('pêche','pomme')
sortedtest=sorted(test,key=functools.cmp_to_key(locale.strcoll))

But it doesn't work (returns 'pomme, pêche' instead of 'pêche, pomme'). Could anyone help me?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Jeronome
  • 31
  • 1
  • How _should_ accented characters sort in French? If they should be treated the same as if they were unaccented, see https://stackoverflow.com/q/517923/3001761. – jonrsharpe Apr 18 '21 at 08:36
  • 4
    You can't do this correctly with `sort()` because the rule is too complex. You need the Unicode Collation Algorithm. There is a Python implementation called `pyuca`: `pip install pyuca`. This takes account of the fact that collation in French only takes account of accents when they are the only way to distinguish 2 words, for example *ou* and *où*. – BoarGules Apr 18 '21 at 09:27
  • Check out [IBM's ICU library](https://stackoverflow.com/a/1098160/14425421). Here's an answer about it in another thread: https://stackoverflow.com/a/1098160/14425421 – Zeyad Apr 18 '21 at 10:33
  • Thank you everyone! Pyuca seems to work perfectly. So I have not even tried IBM's library. – Jeronome Apr 18 '21 at 14:27

1 Answers1

1

I've run a few tests for you. I mean that as a comment but it doesn't fit into a comment so I must send it as an answer.

In [1]: import locale 
   ...: import functools 
   ...:  
   ...: locale.setlocale(locale.LC_ALL, 'fr_FR') 
   ...: test=('pêche','pomme') 
   ...: sorted(test,key=functools.cmp_to_key(locale.strcoll))                                                                                                                                      
Out[1]: ['pêche', 'pomme']

In [2]: import locale 
   ...: import functools 
   ...:  
   ...: locale.setlocale(locale.LC_ALL, 'fr_FR.utf8') 
   ...: test=('pêche','pomme') 
   ...: sorted(test,key=functools.cmp_to_key(locale.strcoll))                                                                                                                                      
Out[2]: ['pêche', 'pomme']

In [3]: import locale 
   ...: import functools 
   ...:  
   ...: locale.setlocale(locale.LC_ALL, 'fr_FR.ISO-8859-1') 
   ...: test=('pêche','pomme') 
   ...: sorted(test,key=functools.cmp_to_key(locale.strcoll))                                                                                                                                      
Out[3]: ['pêche', 'pomme']

In [4]: import locale 
   ...: import functools 
   ...:  
   ...: locale.setlocale(locale.LC_ALL, 'en_GB.ISO-8859-1') 
   ...: test=('pêche','pomme') 
   ...: sorted(test,key=functools.cmp_to_key(locale.strcoll))                                                                                                                                      
Out[4]: ['pêche', 'pomme']

Until now I could not get a result with returning 'pomme, pêche' instead of 'pêche, pomme'. I always get it in the order as you would like.

quantummind
  • 2,086
  • 1
  • 14
  • 20
  • Thanks for testing! Unfortunately I don't understand why I can't make it work the same way. – Jeronome Apr 18 '21 at 14:25
  • How would you manage if you also wanted the sort to be case insensitive? Input: `['pêche', 'Pomme']`, would give you `['Pomme', 'pêche']` – gdvalderrama Apr 25 '22 at 09:02