5

It seems after some first tests, that Python is using the same sorting order as Linux sort (gnu sort) with the C sorting order (if the locale is set to "C").

However I'd like to be able to write Python code that is sorting and comparing the same way as gnu sort depending on the locale.

Small example code to illustrate the issue:

import os 
import subprocess

words = [
    "Abd",
    "éfg",
    "aBd",
    "aBd",
    "zzz",
    "ZZZ",
    "efg",
    "abd",
    "fff",
    ]

with open("tosort", "w") as fout:
    for word in words:
        fout.write(word + "\n")

os.environ["LC_ALL"] = "en_US.UTF-8" 
proc = subprocess.Popen(["sort", "tosort"], stdout=subprocess.PIPE)
sort_en_utf = proc.stdout.read().decode('utf-8').split()

os.environ["LC_ALL"] = "C" 
proc = subprocess.Popen(["sort", "tosort"], stdout=subprocess.PIPE) 
sort_c = proc.stdout.read().decode('utf-8').split()

os.environ["LC_ALL"] = "en_US.UTF-8"
sort_py = sorted(words)

for row in zip(sort_en_utf, sort_c, sort_py):
    print(" ".join(row))

If above code is run I get following output:

abd Abd Abd
aBd ZZZ ZZZ
aBd aBd aBd
Abd aBd aBd
efg abd abd
éfg efg efg
fff fff fff
zzz zzz zzz
ZZZ éfg éfg

column 1 is the sorting / comparing order that I'd like to have in my python code if the locale is "en_US.UTF-8" column 2 and 3 show, that python sorts the same way as linux' sort if the locale is set to "C".

So I'd also like, to know whether there is a way to have:

"éfg" < "fff" yield True. I don't insist on a compare operator I can also call a function. but the ordering result should be considering the current locale.

gelonida
  • 5,327
  • 2
  • 23
  • 41
  • 2
    Related: https://stackoverflow.com/q/4836710/674039 – wim Oct 01 '19 at 22:12
  • 1
    will look at natsort. will check especially whether this libary respects the locale. and whether it would also sort letters like "ß" (should be sorted like "ss") – gelonida Oct 01 '19 at 22:20
  • Just checked natsort. it is interesting but doing something different. it will for example order numbers within strings differently e.g. `"a2" < "a10"`, so it wouldn't behave identical to a sort without options but a locale with UTF – gelonida Oct 01 '19 at 22:23
  • the doc of natsort mentions PyICU https://pypi.org/project/PyICU/ will look at this one now – gelonida Oct 01 '19 at 22:28
  • 1
    If you need this to be foolproof, the best way might be actually calling bash sort from python :) – wim Oct 01 '19 at 23:18
  • 2
    Thanks wim. Your suggestion to look at the other article made me follow a series of several links, which led me in the end to howto/sorting.html, where I should have looked in the first place. – gelonida Oct 01 '19 at 23:18
  • @wim, well this is what I'm doing at the moment (calling subprocess, sort) , but I don't really like it that much, though the sort command is fast and falls back to merge sort if the files are really huge, so no danger of consuming all RAM. In fact I had one issue, where I was sorting a list the command line sort and then I was processing the result with a python script. The fact, that both did not agree on the same ordering cost me quite some debugging time – gelonida Oct 01 '19 at 23:21
  • my bad. will change the title. And you're right bash sort was not really good wording – gelonida Oct 01 '19 at 23:24

1 Answers1

2

Hmmm somehow I overlooked this:

The sorting doc of python https://docs.python.org/3.5/howto/sorting.html mentions in the last section "Odds and Ends" the function locale.strxfrm() (see https://docs.python.org/3.5/library/locale.html#locale.strxfrm ) as key function for sorting and locale.strcoll() as a comparison function.

So following modified code is almost OK, except that the comparison function does not return directly true / false, but this is OK in my context

import subprocess

words = [
    "Abd",
    "éfg",
    "aBd",
    "aBd",
    "zzz",
    "ZZZ",
    "efg",
    "abd",
    "fff",
    "sra",
    "ssa",
    "ssb",
    "stb",
    "ßaa",
    ]

val1 = "ßaa"
val2 = "ssb"

with open("tosort", "w") as fout:
    for word in words:
        fout.write(word + "\n")

os.environ["LC_ALL"] = "en_US.UTF-8"
proc = subprocess.Popen(["sort", "tosort"], stdout=subprocess.PIPE)
sort_en_utf = proc.stdout.read().decode('utf-8').split()

os.environ["LC_ALL"] = "C"
proc = subprocess.Popen(["sort", "tosort"], stdout=subprocess.PIPE)
sort_c = proc.stdout.read().decode('utf-8').split()

locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
sort_py1 = sorted(words, key=locale.strxfrm)
print("%r < %r = %s , but locale.strcoll(%r, %r) = %s for %s"
      % (val1, val2, val1 < val2, val1, val2,
         locale.strcoll(val1, val2), locale.getlocale())
      )

locale.setlocale(locale.LC_ALL, "C")
sort_py2 = sorted(words, key=locale.strxfrm)
print("%r < %r = %s , but locale.strcoll(%r, %r) = %s for %s"
      % (val1, val2, val1 < val2, val1, val2,
         locale.strcoll(val1, val2), locale.getlocale())
      )

for row in zip(sort_en_utf, sort_py1, sort_c, sort_py2):
    print(" ".join(row))

The output would be

'ßaa' < 'ssb' = False , but locale.strcoll('ßaa', 'ssb') = -1 for ('en_US', 'UTF-8')
'ßaa' < 'ssb' = False , but locale.strcoll('ßaa', 'ssb') = 1 for (None, None)
abd abd Abd Abd
aBd aBd ZZZ ZZZ
aBd aBd aBd aBd
Abd Abd aBd aBd
efg efg abd abd
éfg éfg efg efg
fff fff fff fff
sra sra sra sra
ssa ssa ssa ssa
ßaa ßaa ssb ssb
ssb ssb stb stb
stb stb zzz zzz
zzz zzz ßaa ßaa
ZZZ ZZZ éfg éfg
gelonida
  • 5,327
  • 2
  • 23
  • 41