1

Input. Let's say I have a file named f with the following content:

+% H
% V

I can get the first column sorted in two ways:

First approach:

cat f | awk '{print $1}' | sort

Second approach:

cat f | sort -k 1 | awk '{print $1}'

Output & Question. It seems to me that it must be the same results, but it isn't: the output of the first command is:

%
+%

and of the second is:

+%
%

If I swap H and V in the second column of file output of the second command would change, but it shouldn't. Flag for stable sort doesn't change anything. All these were tested for bash versions:

GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)

and

GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu).

So, my question is why do outputs differ?

Filipp Voronov
  • 4,077
  • 5
  • 25
  • 32
  • 2
    Locale — it will be to do with locale. Try with LANG=C immediately before the sorts (`cat f | awk '{print $1}' | LANG=C sort` and `cat f | LANG=C sort -k 1 | awk '{print $1}'`). If you have `LC_COLLATE` set in the environment, you may need to override that too/instead. – Jonathan Leffler Jul 05 '17 at 14:10
  • @JonathanLeffler, it works! but why? how does it affect `sort -k 1` but not `sort`? – Filipp Voronov Jul 05 '17 at 14:12
  • BTW, `sort` is not part of bash, and the bash version has absolutely no way of impacting how it operates. – Charles Duffy Jul 05 '17 at 14:14
  • 1
    As another aside -- it's more efficient to directly pass your files to programs that will process them, as opposed to passing a FIFO from `cat`. That's *especially* true for `sort`, which can parallelize for large files when given a seekable file descriptor, whereas a FIFO can only ever be read front-to-back. Thus, `sort file` or `sort – Charles Duffy Jul 05 '17 at 14:16
  • The first sort only has the two symbol sequences; there is nothing else for it to work on. The output should be the same regardless of locale. That, unfortunately, is the easy bit to explain. The `sort -k 1` behaviour will take considerable understanding. It's as though the locale setting makes it sort on alphanumerics if there are alphanumerics, even if the first part of the key has no alphanumerics. However, I don't understand the mechanism by which this occurs. One more thing to try: `sort -k 1,1` — that limits the sorting to the first column. – Jonathan Leffler Jul 05 '17 at 14:18
  • One reason I commented rather than answered was I hadn't tested my hypothesis. Another is that there are occasions when `sort` baffles me — it works sanely with the C locale, but not so intuitively with other locales. But the 'other locale' definition of collating order is well hidden. I'm not sure how you even find out what the collating rules are — and how they work. – Jonathan Leffler Jul 05 '17 at 14:21
  • I agree that there's a good question for which we don't have an obvious duplicate somewhere around here ("Why does GNU sort treat locale differently when -k is passed?" or such -- edited appropriately if the behavior is reproducible with BSD sort, busybox sort, or others). – Charles Duffy Jul 05 '17 at 16:12

0 Answers0