4

I am working with a file that contains 3 values, an ID (they happen to be protein ids in case you are curious), a value, and then another value. It is tab delimited, so it looks like this:

A2M     0.979569315988908       1
AACS    0.925340159491081       1
AAGAB   0.982296215686199       1
AAK1    0.736903840140103       1
AAMP    0.00589711816127862     0.138868449447202
AARS2   1       1
AARS    3.13300124295614e-05    0.00212792325492566
AARSD1  0.527417792161261       1
AASDH   0.869909252023668       1
AASDHPPT        0.763918221284724       1
AATF    0.691907759125663       1
ABAT    0.989693691462661       1
ABCA1   0.601194017450064       1
ABCA5   1       1
ABCA6   1       1

I am interested in sorting these IDs in alphabetical order and extracting various values. However, I noticed that sort sorts the IDs differently, depending on what I am extracting. When I execute:

    cut --fields\=1,2 input.txt|sort --key=1

The resulting file is:

A2M     0.979569315988908
AACS    0.925340159491081
AAGAB   0.982296215686199
AAK1    0.736903840140103
AAMP    0.00589711816127862
AARS2   1
AARS    3.13300124295614e-05 
AARSD1  0.527417792161261
AASDH   0.869909252023668
AASDHPPT        0.763918221284724
AATF    0.691907759125663
ABAT    0.989693691462661
ABCA1   0.601194017450064
ABCA5   1
ABCA6   1

BUT When I execute:

cut --fields\=1,3 input.txt|sort --key=1

I get

A2M     1
AACS    1
AAGAB   1
AAK1    1
AAMP    0.138868449447202
AARS    0.00212792325492566
AARS2   1
AARSD1  1
AASDH   1
AASDHPPT        1
AATF    1
ABAT    1
ABCA1   1
ABCA5   1
ABCA6   1

Notice that the positions of AARS and AARS2 are switched, which they shouldn't be since I am just sorting based on the first column. I've never seen any behavior like this from sort, and I've been using bash for a while now. Is this a bug, or am I doing something wrong?

Josh
  • 1,155
  • 4
  • 12
  • 21
  • 1
    can't reproduce that here with cut/sort v8.21. I get aars->aars2 with both 1,2 and 1,3 – Marc B May 15 '15 at 14:09
  • You don't need to escape (or even use) the `=` in the call to `cut`. – chepner May 15 '15 at 14:13
  • This is incredible I actually have reproduced this but don't believe it. – ojblass May 15 '15 at 14:14
  • @MarcB I have sort sort (GNU coreutils) 8.4 and cut cut (GNU coreutils) 8.4 @shellter when I add -t="\t" to sort I get "error sort: multi-character tab `=\\t'" – Josh May 15 '15 at 14:21

3 Answers3

4

The --key=1 option tells sort to use all "fields" from the first through the end of the line to sort the input. As @rici observed first, by default this is a locale-sensitive sort, and in many locales whitespace is ignored for collation purposes. That's what seems to be happening here.

If you want to sort only on the protein IDs, then that would be this:

cut --fields=1,2 input.txt | sort --key=1,1
cut --fields=1,3 input.txt | sort --key=1,1

@rici explains how to approach the problem by specifying a collation order that accounts for whitespace.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
2

You're using a locale-aware sort (which is the default). In many locales, whitespace is explicitly ignored in the collation order; that, combined with the fact that your key extends from the first field to the end of the line (which means that the --key option is redundant), effectively means that the lines are sorted as though the fields were concatenated without intervening whitespace.

There's a much longer explanation here: https://stackoverflow.com/a/27951508/1566221

My preference is to use LC_COLLATE=C sort ... for a non-locale-aware sort. (For example, define alias csort="LC_COLLATE=C sort"). In this case you could also just explicitly terminate the sort key by using -k1,1. If your first columns are unique, then that is sufficient.

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
  • Good observation about the effect of locale-awareness. You are quite right that choosing collation according to the C locale also resolves the issue. – John Bollinger May 15 '15 at 14:54
0

I think that sort is skips tabs... the net effect is that AARS0.00212792325492566 comes before AARS21 but AARS21 comes before AARS3.13300124295614e-05. See this quesiton.

The following should work

cut -f1,2 input.txt | sort -t$'\t'

Unfortunately it doesn't but I think this stripping out of tabs is what is causing the issue.

Community
  • 1
  • 1
ojblass
  • 21,146
  • 22
  • 83
  • 132