0

I have a pipeline where I sort a huge number of strings with unix/bash tools and then need to do string comparisons based on this sort order. I need this to function equally on different unix/linux systems. It turns out that the character order by unix sort is different depending on the locale settings, and not necessarily identical to the sort() method of python. I have modified the sort commands to use env LC_ALL=C, to ensure that the same sorting ranks are used on all systems (and a locale setting is used that should be available everywhere).

Now I learned that I can use the python locale module to set the locale to "C" with locale.setlocale(locale.LC_ALL, "C") and locale.strcoll() to do ensure that string comparisons in python work similar to the previous unix sort based sorting.

However, it would seem to me that the standard string sorting of python is based on character byte values and therefore equivalent to the locale "C" setting? Wouldn't that mean that I could still trust that the sorts are identical to unix sorts simply using the standard python sorting functions (without the locale module), at least as long as I make sure that unix sorts are done using locale "C" settings?

martineau
  • 119,623
  • 25
  • 170
  • 301
jov14
  • 139
  • 9
  • 1
    Python 3 strings are Unicode strings. If you want to match C code using 8-bit characters, then you may need to use byte strings. That means you have to worry about encoding characters beyond ordinal 127, but you'd need to do that anyway. – Tim Roberts Dec 29 '21 at 17:33
  • https://stackoverflow.com/questions/26505661/does-python-3-string-ordering-depend-on-locale – Thierry Lathuille Dec 29 '21 at 17:34
  • I believe the linked duplicate (as first suggested by Thierry) is a proper superset of this question (its answers provide everything needed to infer this question's answer... an answer in line with Tim's comment above). – Charles Duffy Dec 29 '21 at 17:38

0 Answers0