Unix sort - character comparison algorithm

Question

I need to have a file sorted with consistent way as Python would do so.

I have some file sorted using Unix sort program. After sorting this file, I wrote Python script for checking if that was sorted correctly:

with open('my_file_location') as f:
    last_l = next(f)
    for l in f:
        if last_l > l:
            print(last_l, l)
            break
        last_l = l

Script failed giving the following entry:

('250,8\n', '25,1\n')

I experimented a bit with sort tool, to check if the output is actually repeatable and inconsistent with Python comparison algorithm. Finally, I found two interesting cases:

 $ echo -e "250,1\n25,8" | sort
250,1
25,8
 $ echo -e "250,\n25," | sort
25,
250,

Why these two calls give me two different orders? I consider it a bit weird, because the beginning characters remains the same and only ending changes.

My file is pretty huge and it would be the best for me to stay by my current sorted file. How can I apply the same string comparison in Python?

If it is impossible to implement this comparison quickly, or there might hapen some other issue, how can I sort my file using sort again but this time with Pythonly correct comparison algorithm?

UPDATE

Example of Python output below (inconsistent with output of Unix sort tool):

>>> '250,1' > '25,8'
True
>>> '250,' > '25,'
True

Contrary to Unix sort tool, in Python both comparisons give the same result.

Possible duplicate of [Does Python have a built in function for string natural sort?](http://stackoverflow.com/questions/4836710/does-python-have-a-built-in-function-for-string-natural-sort) — Christian König, May 22 '17 at 12:08
your current locale affects the order produced by `sort` - `LC_ALL=C echo -e "250,1\n25,8" | sort` should be "consistent" with your example... — ewcz, May 22 '17 at 12:13
@Chris_Rands Why those two outputs of `echo | sort` calls give me two differ orders and why it does not happen in Python. The question has been updated with the example. — pt12lol, May 22 '17 at 12:14
@ChristianKönig I tried solution with `natsort` package from the question you linked (solution answer) and it didn't work in my case (failed on another entry). I don't think it covers my issue. — pt12lol, May 22 '17 at 12:29
@pt12lol sorry, it should be `echo -e "250,1\n25,8" | LC_ALL=C sort` — ewcz, May 22 '17 at 12:37

score 1 · Accepted Answer · answered May 22 '17 at 12:45

1

you can confirm that the locale is the culprit with:

import locale

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
L = ['250,1', '25,8']
print(sorted(L, cmp=locale.strcoll))
#['250,1', '25,8']

locale.setlocale(locale.LC_ALL, 'C')
print(sorted(L, cmp=locale.strcoll))
#['25,8', '250,1']

answered May 22 '17 at 12:45

ewcz

12,819
1
25
47

`locale.setlocale` and then calling `locale.strcoll` function seems to be solution for my problem. – pt12lol May 22 '17 at 13:16

Unix sort - character comparison algorithm

1 Answers1