6

unix numeric sort gives strange results, even when I specify the delimiter.

$ cat example.csv  # here's a small example
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

$ cat example.csv | sort -n --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

For this example, sort gives the same result regardless if you specify the delimiter. I know if I set LC_ALL=C then sort starts to give expected behavior again. But I do not understand why the default environment settings, as shown below, would make this happen.

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

I've read from many other questions (e.g. here, here, and here) how to avoid this behavior in sort, but still, this behavior is incredibly weird and unpredictable and has caused me a week of heartache. Can someone explain why sort with default environment settings on Mac OS X (10.8.5) would behave this way? In other words: what is sort doing (with local variables set to en_US.UTF-8) to get that result?

I'm using

 sort 5.93                        November 2005

 $ type sort
 sort is /usr/bin/sort

UPDATE

I've discussed this on the gnu-coreutils list and now understand why sort with english unicode default locale settings gave the output it did. Because in English unicode, the comma character "," is considered a numeric (so as to allow for comma's as thousand's (or e.g. hundreds) separators), and sort defaults to "being greedy" when it interprets a line, it read the example numbers as approximately

581.491...
590.000...
590.001...
591.492...
600.001...
601.492...
1213.08...
1214.15...
1226.76...
1250.45...

Although this was not what I had intended and chepner is right that to get the actual result I want, I need to specify that I want sort to key on only the first field. sort defaults to interpreting more of the line as a key rather than just the first field as a key.

This behavior of sort has been discussed in gnu-coreutil's FAQ, and is further specified in the POSIX description of sort.

So that, as Eric Blake on the gnu-coreutil's list put it, if the field-separator is also a numeric (which a comma is) then "Without -k to stop things, [the field-separator] serves as BOTH a separator AND a numeric character - you are sorting on numbers that span multiple fields."

Community
  • 1
  • 1
gabe
  • 2,521
  • 2
  • 25
  • 37
  • 1
    Maybe it's allowing the comma as thousands separator, and permitting it anywhere? Certainly a bug with `-t,` at least if that's the case. – tripleee Oct 07 '13 at 15:45
  • what happens if you specify which columns are the key to sort on? i.e either `+0 -1` or `-k1`? Good luck. – shellter Oct 07 '13 at 15:47
  • 1
    "Without -k to stop things, [the field-separator] serves as BOTH a separator AND a numeric character - you are sorting on numbers that span multiple fields." Actually, even when using -k I'm getting the same issue, which doesn't make any sense. – Ethan Herdrick Oct 15 '15 at 20:15
  • @hendrick what is your Locale? Open a question and I will take a look. – gabe Oct 15 '15 at 20:50

3 Answers3

7

I'm not sure this is entirely correct, but it's close.

sort -n -t, will try to sort numerically by the given key(s). In this case, the key is a tuple consisting of an integer and a float. Such tuples cannot be sorted numerically.

If you explicitly specify which single keys to sort on with

sort -k1,1n -k2,2n -t,

it should work. Now you are explicitly telling sort to first sort on the first field (numerically), then on the second field (also numerically).

I suspect that -n is useful as a global option only if each line of the input consists of a single numerical value. Otherwise, you need to use the -n option in conjunction with the -k option to specify exactly which fields are numbers.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • That's interesting. `sort -k1,1n` works. But what is it doing when I enter `sort -k1n` (which does not work) ? Why does it try to sort a tuple instead of defaulting to the first value in a line? And when I specify the first column with `-k1n` why doesn't that sort by the first column? – gabe Oct 07 '13 at 16:21
  • 2
    `-k,` sorts using fields `m` through `n`, inclusive. `-k` uses fields `m` through the last field. You have to use `-k,` to sort on exactly field `m`. Using `-k1` is legal, but identical to not using `-k` at all. Something like `-k2` is more common (sort on all fields but the first). – chepner Oct 07 '13 at 16:24
  • Thank you very much for your responses. That makes sense (though counter-intuitive), but then not including the -k at all should mean that it sorts on all fields -- using the first field first. Why isn't it doing this? – gabe Oct 07 '13 at 16:30
  • 1
    The distinction is that `sort -n` and `sort -k1 -n` uses a single key with a non-numeric value (specifically, a tuple of an int and a float), while `sort -k1,1n -k2,2n` uses two keys, each of which is a single numerical value. You are using "field" and "key" as synonyms, which they are not. A key can consist of 1 or more fields, and `sort` can use one or more keys (by sorting on the first key, then the second key, etc). – chepner Oct 07 '13 at 16:32
  • That makes sense. But why can't tuples of numerics be sorted numerically? – gabe Oct 08 '13 at 14:41
  • 1
    Because tuples aren't numbers. Also, `tuple` is probably a poor choice of words on my part. `sort` seems to simply concatenate the fields that make up a key into a single string. But the fact is, `sort` simply doesn't distinguish between the fields used to create such a key. You can argue that perhaps it should, but it doesn't. – chepner Oct 08 '13 at 14:57
  • Thank you. That completely answers my question. The next step is to see where the conversation on this is at in unix! Appreciate the help. – gabe Oct 08 '13 at 16:59
2

Use sort --debug to find out what's going on. I've used that to explain in detail your issue at: http://lists.gnu.org/archive/html/coreutils/2013-10/msg00004.html

pixelbeat
  • 30,615
  • 9
  • 51
  • 60
  • Thanks so much for the link! I finally understand my issue with numeric sort - in my case `sort` was ignoring parts of numbers after decimal dot `.` (as my locale uses coma `,`) and then the full line for secondary sort order. I coudn't figure out what order it was giving me, now I know. – jena Dec 05 '22 at 17:00
  • 1
    This is a common enough issue that we added more info to `sort --debug` re thousands grouping and decimal separators https://github.com/coreutils/coreutils/commit/bafff001 – pixelbeat Dec 07 '22 at 14:04
0

If you use

cat example.csv | sort

instead of

cat example.csv | sort -n --field-separator=,

then it would give correct output. Use this command, hope this is helpful to you.

Note: I tested with "sort (GNU coreutils) 7.4"

Shahbaz
  • 46,337
  • 19
  • 116
  • 182
Birendra Kumar
  • 431
  • 1
  • 7
  • 18
  • Thanks -- and this is interesting (I wonder why it gives this behavior). However it does not answer the question. The question is: when you ask unix to sort numeric, why doesn't it sort numeric? What is it doing in this case? – gabe Oct 07 '13 at 16:03
  • 2
    With no options, you are simply performing a lexicographic sort, which appears to work on this input because all the leading integer values are two-digit strings. – chepner Oct 07 '13 at 16:11