I'm having some problems understanding the behavior of sort with locale set to en_US.UTF-8. Here is my example data:
ENST00000623237 CTD-2651B20.8 15 45215040 45214916 ENST00000481096 MAGED4B X 52063479 52063359 125 4.02e-29
ENST00000623237 CTD-2651B20.8 15 45215040 45214916 ENST00000481096 MAGED4B X 52063479 52063359 125 7.16e-30
ENST00000623237 CTD-2651B20.8 15 45215040 45214916 ENST00000479281 MAGED4 X 52190616 52190736 125 3.75e-29
ENST00000623237 CTD-2651B20.8 15 45215040 45214916 ENST00000479281 MAGED4 X 52190616 52190736 125 7.16e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214916 ENST00000408548 SNORA11D X 52190621 52190736 118 1.30e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214916 ENST00000408548 SNORA11D X 52190621 52190736 118 7.16e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214916 ENST00000408778 SNORA11E X 52063474 52063359 118 1.30e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214916 ENST00000408778 SNORA11E X 52063474 52063359 118 7.16e-30
ENST00000623237 CTD-2651B20.8 15 45215033 45214906 ENST00000408163 SNORA11 15 45215033 45214906 128 5.31e-61
ENST00000623237 CTD-2651B20.8 15 45215033 45214906 ENST00000408163 SNORA11 15 45215033 45214906 128 9.60e-62
ENST00000623237 CTD-2651B20.8 15 45215033 45214915 ENST00000408789 SNORA11 X 54814370 54814486 121 4.28e-32
ENST00000623237 CTD-2651B20.8 15 45215033 45214915 ENST00000408789 SNORA11 X 54814370 54814486 121 7.74e-33
ENST00000623237 CTD-2651B20.8 15 45215033 45214964 ENST00000408823 SNORA11 X 54927305 54927374 70 2.02e-20
ENST00000623237 CTD-2651B20.8 15 45215033 45214964 ENST00000408823 SNORA11 32 54927305 54927374 70 3.69e-21
ENST00000623237 CTD-2651B20.8 15 45215033 45214964 ENST00000469211 TRO X 54927305 54927374 70 2.02e-20
ENST00000623237 CTD-2651B20.8 15 45215033 45214964 ENST00000469211 TRO X 54927305 54927374 70 2.89e-20
I would now need to sort based on the 7th column (MAGED4B...)...
So, if I run:
cut -f1,7 sortTest.txt | sort -k2,2
I get the expected output:
ENST00000623237 MAGED4
ENST00000623237 MAGED4
ENST00000623237 MAGED4B
ENST00000623237 MAGED4B
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11
ENST00000623237 SNORA11D
ENST00000623237 SNORA11D
ENST00000623237 SNORA11E
ENST00000623237 SNORA11E
ENST00000623237 TRO
ENST00000623237 TRO
But when I add the column next to the one to be sorted:
cut -f1,7,8 sortTest.txt | sort -k2,2
Results are no longer as expected:
ENST00000623237 MAGED4B X
ENST00000623237 MAGED4B X
ENST00000623237 MAGED4 X
ENST00000623237 MAGED4 X
ENST00000623237 SNORA11 15
ENST00000623237 SNORA11 15
ENST00000623237 SNORA11 32
ENST00000623237 SNORA11D X
ENST00000623237 SNORA11D X
ENST00000623237 SNORA11E X
ENST00000623237 SNORA11E X
ENST00000623237 SNORA11 X
ENST00000623237 SNORA11 X
ENST00000623237 SNORA11 X
ENST00000623237 TRO X
ENST00000623237 TRO X
To make thinks even more weird, when I append not the next, but the 2nd next to the 7th column:
cut -f1,7,9 sortTest.txt | sort -k2,2
The output is again as expected:
ENST00000623237 MAGED4 52190616
ENST00000623237 MAGED4 52190616
ENST00000623237 MAGED4B 52063479
ENST00000623237 MAGED4B 52063479
ENST00000623237 SNORA11 45215033
ENST00000623237 SNORA11 45215033
ENST00000623237 SNORA11 54814370
ENST00000623237 SNORA11 54814370
ENST00000623237 SNORA11 54927305
ENST00000623237 SNORA11 54927305
ENST00000623237 SNORA11D 52190621
ENST00000623237 SNORA11D 52190621
ENST00000623237 SNORA11E 52063474
ENST00000623237 SNORA11E 52063474
ENST00000623237 TRO 54927305
ENST00000623237 TRO 54927305
I have used the "--debug" parameter (with and w/o -b as well) on all trials to check whether fields may be identified wrongly, but this is not the case...
This "issue" resolves if I set LC_ALL=C, but I would prefer not do it as I'm not sure how this may effect the rest of my pipeline...