Yes! to your specific question
Will I always get the same results as Linux sort with LC_ALL=C
?
Yes! Python defaults to the C
locale, so you can expect the same behavior as linux LC_ALL=C sort
.
You can be more explicit about this behavior by setting it yourself and sorting with strxfrm:
locale.setlocale(locale.LC_ALL, 'C') # same as you do in linux
locale.setlocale(locale.LC_COLLATE, 'C') # specific to sorting
mylist.sort(key=locale.strxfrm)
# To incorporate locale sorting with other uses of key=,
# wrap locale.strxfrm() around whatever else you're doing:
mylist.sort(key=lambda i: locale.strxfrm( mysortfunc(i) ))
Documentation
From https://docs.python.org/3/library/locale.html :
Initially, when a program is started, the locale is the C
locale, no matter what the user’s preferred locale is. ... The program must explicitly say that it wants the user’s preferred locale settings for other categories by calling setlocale(LC_ALL, '')
.
According to POSIX, a program which has not called setlocale(LC_ALL, '')
runs using the portable 'C'
locale. Calling setlocale(LC_ALL, '')
lets it use the default locale as defined by the LANG
variable. Since we do not want to interfere with the current locale setting we thus emulate the behavior in the way described above.
Example
# What are the settings when Python first starts?
>>> import locale
>>> locale.setlocale(locale.LC_ALL, None) # If locale is omitted or None, the current setting for category is returned.
'LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C'
# ^^^^^^^^^^^^
>>> locale.getlocale(locale.LC_COLLATE) # The 'C' setting is equivalent to:
(None, None)
# Set LC_COLLATE & use strcoll/strxfrm to sort according to user's locale
# (like linux sort(1) does by default):
>>> locale.setlocale(locale.LC_COLLATE, '') # An empty string specifies the user’s default settings.
'en_US.UTF-8'
>>> locale.setlocale(locale.LC_ALL, None)
'LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C'
# ^^^^^^^^^^^^^^^^^^^^^^
>>> mylist.sort(key=locale.strxfrm)
>>> mylist.sort(key=lambda i: locale.strxfrm( mysortfunc(i) ))
# Set LC_ALL (everything) to user's locale (common practice):
>>> locale.setlocale(locale.LC_ALL, '')
'en_US.UTF-8'
>>> locale.setlocale(locale.LC_ALL, None)
'en_US.UTF-8'
>>> locale.getlocale(locale.LC_COLLATE)
('en_US', 'UTF-8')
# Use portable/C locale, including byte-order sorting:
>>> locale.setlocale(locale.LC_ALL, 'C')
'C'
>>> locale.setlocale(locale.LC_ALL, None)
'C'
# The LC_ALL setting overrode our previous LC_COLLATE setting:
>>> locale.setlocale(locale.LC_COLLATE, None)
'C'
>>> locale.getlocale(locale.LC_COLLATE)
(None, None)
Many thanks to Frédéric Hamidi's answer, which sent me in the right direction to understand this.