I have a pipeline where I sort a huge number of strings with unix/bash tools and then need to do string comparisons based on this sort order.
I need this to function equally on different unix/linux systems.
It turns out that the character order by unix sort is different depending on the locale settings, and not necessarily identical to the sort() method of python. I have modified the sort commands to use env LC_ALL=C
, to ensure that the same sorting ranks are used on all systems (and a locale setting is used that should be available everywhere).
Now I learned that I can use the python locale module to set the locale to "C" with locale.setlocale(locale.LC_ALL, "C")
and locale.strcoll()
to do ensure that string comparisons in python work similar to the previous unix sort based sorting.
However, it would seem to me that the standard string sorting of python is based on character byte values and therefore equivalent to the locale "C" setting? Wouldn't that mean that I could still trust that the sorts are identical to unix sorts simply using the standard python sorting functions (without the locale module), at least as long as I make sure that unix sorts are done using locale "C" settings?