4

I tried to look this up in the man pages of the sort command, but could not find anything. So consider the following text file t.txt:

 11
1 0

(Binary representation of t.txt

$ xxd -p t.txt
2031310a3120300a

)

using LC_COLLATE="en_US.UTF-8" with sort on this file gives:

$  LC_COLLATE="en_US.UTF-8" sort t.txt
1 0
 11

If we examine the second character position (or column) in the file, we observe that the first row has a space, and the second row has a 1. Since space has hexadecimal value of 0x20 which is less than the hexadecimal value of 1 (which is 0x31) I would assume that sort would give:

 11
1 0 

It turns out that the expected sorting order can be obtained using LC_COLLATE=c

$ LC_COLLATE=c sort t.txt
 11
1 0

What is the reason for the difference between LC_COLLATE="en_US.UTF-8" and LC_COLLATE=c for this case?

See also:

Edit:

Some more information about this issue was found here:

Community
  • 1
  • 1
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
  • 3
    It depends on your locale. Check for example `LC_ALL=C sort file`, that gives `A 11` first. See http://www.manpagez.com/info/coreutils/coreutils_196.php#SEC196 – fedorqui May 14 '14 at 16:34
  • @fedorqui But why does it not work without `LC_ALL=C` ? (`echo $LANG` gives `en_US.UTF-8`) – Håkon Hægland May 14 '14 at 16:45
  • @HåkonHægland The simple answer is "because the sorting rules are different in different locales". The full answer is probably quite a bit more complex... – twalberg May 14 '14 at 19:25

1 Answers1

3

punctuation is ignored when ordering in the en_US locale

Note sort can explicitly skip whitespace with the -b option, but note that's trick to use, so I'd advise using the sort --debug option when using that.

pixelbeat
  • 30,615
  • 9
  • 51
  • 60
  • Thanks! That is interesting. I also found some more information here: [In utf-8 collation, why 11- is less then 1-?](http://superuser.com/questions/227925/in-utf-8-collation-why-11-is-less-then-1) and [UNICODE COLLATION ALGORITHM](http://unicode.org/reports/tr10/). – Håkon Hægland May 18 '14 at 17:41