Understanding gnu sorting with en_US.UTF-8

Question

I'm having some problems understanding the behavior of sort with locale set to en_US.UTF-8. Here is my example data:

ENST00000623237 CTD-2651B20.8   15  45215040    45214916    ENST00000481096 MAGED4B X   52063479    52063359    125 4.02e-29
ENST00000623237 CTD-2651B20.8   15  45215040    45214916    ENST00000481096 MAGED4B X   52063479    52063359    125 7.16e-30
ENST00000623237 CTD-2651B20.8   15  45215040    45214916    ENST00000479281 MAGED4  X   52190616    52190736    125 3.75e-29
ENST00000623237 CTD-2651B20.8   15  45215040    45214916    ENST00000479281 MAGED4  X   52190616    52190736    125 7.16e-30
ENST00000623237 CTD-2651B20.8   15  45215033    45214916    ENST00000408548 SNORA11D    X   52190621    52190736    118 1.30e-30
ENST00000623237 CTD-2651B20.8   15  45215033    45214916    ENST00000408548 SNORA11D    X   52190621    52190736    118 7.16e-30
ENST00000623237 CTD-2651B20.8   15  45215033    45214916    ENST00000408778 SNORA11E    X   52063474    52063359    118 1.30e-30
ENST00000623237 CTD-2651B20.8   15  45215033    45214916    ENST00000408778 SNORA11E    X   52063474    52063359    118 7.16e-30
ENST00000623237 CTD-2651B20.8   15  45215033    45214906    ENST00000408163 SNORA11 15  45215033    45214906    128 5.31e-61
ENST00000623237 CTD-2651B20.8   15  45215033    45214906    ENST00000408163 SNORA11 15  45215033    45214906    128 9.60e-62
ENST00000623237 CTD-2651B20.8   15  45215033    45214915    ENST00000408789 SNORA11 X   54814370    54814486    121 4.28e-32
ENST00000623237 CTD-2651B20.8   15  45215033    45214915    ENST00000408789 SNORA11 X   54814370    54814486    121 7.74e-33
ENST00000623237 CTD-2651B20.8   15  45215033    45214964    ENST00000408823 SNORA11 X   54927305    54927374    70  2.02e-20
ENST00000623237 CTD-2651B20.8   15  45215033    45214964    ENST00000408823 SNORA11 32  54927305    54927374    70  3.69e-21
ENST00000623237 CTD-2651B20.8   15  45215033    45214964    ENST00000469211 TRO X   54927305    54927374    70  2.02e-20
ENST00000623237 CTD-2651B20.8   15  45215033    45214964    ENST00000469211 TRO X   54927305    54927374    70  2.89e-20

I would now need to sort based on the 7th column (MAGED4B...)...

So, if I run:

cut -f1,7 sortTest.txt | sort -k2,2

I get the expected output:

ENST00000623237 MAGED4
ENST00000623237 MAGED4
ENST00000623237 MAGED4B
ENST00000623237 MAGED4B
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11D
ENST00000623237 SNORA11D
ENST00000623237 SNORA11E
ENST00000623237 SNORA11E
ENST00000623237 TRO
ENST00000623237 TRO

But when I add the column next to the one to be sorted:

cut -f1,7,8 sortTest.txt | sort -k2,2

Results are no longer as expected:

ENST00000623237 MAGED4B X
ENST00000623237 MAGED4B X
ENST00000623237 MAGED4  X
ENST00000623237 MAGED4  X
ENST00000623237 SNORA11 15
ENST00000623237 SNORA11 15
ENST00000623237 SNORA11 32
ENST00000623237 SNORA11D        X
ENST00000623237 SNORA11D        X
ENST00000623237 SNORA11E        X
ENST00000623237 SNORA11E        X
ENST00000623237 SNORA11 X
ENST00000623237 SNORA11 X
ENST00000623237 SNORA11 X
ENST00000623237 TRO     X
ENST00000623237 TRO     X

To make thinks even more weird, when I append not the next, but the 2nd next to the 7th column:

cut -f1,7,9 sortTest.txt | sort -k2,2

The output is again as expected:

ENST00000623237 MAGED4  52190616
ENST00000623237 MAGED4  52190616
ENST00000623237 MAGED4B 52063479
ENST00000623237 MAGED4B 52063479
ENST00000623237 SNORA11 45215033
ENST00000623237 SNORA11 45215033
ENST00000623237 SNORA11 54814370
ENST00000623237 SNORA11 54814370
ENST00000623237 SNORA11 54927305
ENST00000623237 SNORA11 54927305
ENST00000623237 SNORA11D        52190621
ENST00000623237 SNORA11D        52190621
ENST00000623237 SNORA11E        52063474
ENST00000623237 SNORA11E        52063474
ENST00000623237 TRO     54927305
ENST00000623237 TRO     54927305

I have used the "--debug" parameter (with and w/o -b as well) on all trials to check whether fields may be identified wrongly, but this is not the case...

This "issue" resolves if I set LC_ALL=C, but I would prefer not do it as I'm not sure how this may effect the rest of my pipeline...

Do you have the same problem if you replace your `cut` with alternative field extraction? E.g. `awk '{print $1" "$7" "$8}'` The spacing in your examples suggests that your input file has a mixture of tabs and spaces. — borrible, Aug 20 '18 at 13:32
@borrible It seems to be just tabs, or `cut` wouldn't work like it does - but you're right, it does look suspicious. Maybe changed when copy-pasting into the question editor? — Benjamin W., Aug 20 '18 at 13:52
Related: https://www.pixelbeat.org/docs/coreutils-gotchas.html#sort https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 https://unix.stackexchange.com/questions/252419/unexpected-sort-order-in-en-us-utf-8-locale http://unicode.org/reports/tr10 — kvantour, Aug 20 '18 at 15:00
you might be interested in : `cut -f1,7,8 sortTest.txt | LC_ALL=C sort -k2,2` — kvantour, Aug 20 '18 at 15:01
@borrible there are just tabs, sry if its difficult to read, didn't know how to do it better ;) — urs, Aug 21 '18 at 09:54
@kvantour ok, didn't know that I can change the locale setting temporally in such a way - thanks! — urs, Aug 21 '18 at 09:55

rici · Answer 1 · 2018-08-21T16:11:04.340

Note: In response to a comment by OP, I re-examined this answer. Indeed, if the sort command were sort -k2,2, the output would be mysterious, and I cannot reproduce it with Gnu sort. So I suspect that the actual command was -k2,3 (or, equivalently, -k2), and I'll leave the answer below on that basis.

The en_US.UTF8 locale sorts digits before letters and ignores whitespace. It produces unexpected ordering in your first example because X comes after E but not in the second example because all digits come before D. Looking at it without whitespace (the way the collation sees it) might clarify:

Columns 7 and 8:

ENST00000623237 MAGED4BX
ENST00000623237 MAGED4BX
ENST00000623237 MAGED4X
ENST00000623237 MAGED4X
ENST00000623237 SNORA1115
ENST00000623237 SNORA1115
ENST00000623237 SNORA1132
ENST00000623237 SNORA11DX
ENST00000623237 SNORA11DX
ENST00000623237 SNORA11EX
ENST00000623237 SNORA11EX
ENST00000623237 SNORA11X
ENST00000623237 SNORA11X
ENST00000623237 SNORA11X
ENST00000623237 TROX
ENST00000623237 TROX

Columns 7 and 9:

ENST00000623237 MAGED452190616
ENST00000623237 MAGED452190616
ENST00000623237 MAGED4B52063479
ENST00000623237 MAGED4B52063479
ENST00000623237 SNORA1145215033
ENST00000623237 SNORA1145215033
ENST00000623237 SNORA1154814370
ENST00000623237 SNORA1154814370
ENST00000623237 SNORA1154927305
ENST00000623237 SNORA1154927305
ENST00000623237 SNORA11D52190621
ENST00000623237 SNORA11D52190621
ENST00000623237 SNORA11E52063474
ENST00000623237 SNORA11E52063474
ENST00000623237 TRO54927305
ENST00000623237 TRO54927305

You can set an environment variable locally for a single c command by putting the setting at the beginning of the command:

... | LC_COLLATE=C sort ... | ...

So you don't have to worry about the setting affecting other commands.

I alias sort to LC_COLLATE=C sort in my bash startup file, because the default debian/ubuntu collation order is useless (at best) for sorting, and also creates a significant slowdown.

A longer answer about the locale issue is here.

You can avoid this particular locale issue by sorting adjacent columns as separate keys (sort -k2,2 -k3,3 instead of sort -k2,3), but unless locale-based sorting is important to the data, it's faster and less confusing to avoid it.)

I would understand this behavior if I'd sort with -k2, but I'm using -k2,2 which should sort based on the second column only, shouldn't it? — urs, Aug 21 '18 at 09:57
@urs: Yes, you're right, and in fact I can't reproduce your output on my machine. I thought it had to do with last-resort sorting, but obviously that's not correct so I reverted my edit. — rici, Aug 21 '18 at 15:58
@urs: I tried a variety of possible command line options, but the only ones which reproduce your results are `-k2` and `-k2,3`. Could you please verify that the command in your question is the one you actually used? Thanks. — rici, Aug 21 '18 at 16:12
It must be mentioned however that setting `LC_COLLATE` will be ignored if `LC_ALL` has been set. To this end, I would suggest the usage of the latter instead of the former. — kvantour, Aug 22 '18 at 06:36
@kvantour: that's a reason to never uset LC_ALL. Typically one would use LANG to set the default locale; that's the way most distros are configured by default. You should never have LC_ALL set — rici, Aug 22 '18 at 08:35
@rici I can confirm that the commands I posted are correct (I just c+p from terminal). I further tested the file + commands on a different system with the same locale setting and the results are perfectly fine... What else on my system could have an influence on the sorting? — urs, Aug 22 '18 at 10:21
@urs: that's a wierd one. Perhaps `sort` is redefined in some way. Let's start with `alias sort`, `which sort` and `sort --version` — rici, Aug 22 '18 at 15:21

Understanding gnu sorting with en_US.UTF-8

1 Answers1