12

I have a text file with lines of UTF-8 encoded text:

mac-os-x$ cat unsorted.txt
ウ
foo
チ
'foo'
津

In case it helps to reproduce the problem, here is a checksum and a dump of the exact bytes in the file, as well as how you could generate the file yourself (on Linux, use base64 -d instead of -D):

mac-os-x$ shasum unsorted.txt
a6d0b708d3e0cafb0c6e1af7450e9243da8cb078  unsorted.txt

mac-os-x$ perl -ne 'print join(" ", map { sprintf "%02x", ord } split //), "\n"' unsorted.txt
e3 82 a6 0a
66 6f 6f 0a
e3 83 81 0a
27 66 6f 6f 27 0a
e6 b4 a5 0a

mac-os-x$ echo 44KmCmZvbwrjg4EKJ2ZvbycK5rSlCg== | base64 -D > unsorted.txt

When I sort this input file on Mac OS X (regardless of whether I use GNU sort 5.93 which Mac OS X Yosemite ships with, or with a Homebrew installed GNU sort version 8.23), I get this sorted result:

mac-os-x$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /usr/bin/sort unsorted.txt
'foo'
foo
ウ
チ
津

mac-os-x$ echo `sw_vers -productName` `sw_vers -productVersion`
Mac OS X 10.10.1

mac-os-x$ /usr/bin/sort --version | head -1
sort (GNU coreutils) 5.93

When I sort the same file, with the same locale settings, on Linux (I tested on both Centos 5.5 and CentOS 6.5), I get a different result:

linux-centos-6.5$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /bin/sort unsorted.txt
ウ
チ
foo
'foo'
津

linux-centos-6.5$ cat /etc/redhat-release
CentOS release 6.5 (Final)

linux-centos-6.5$ /bin/sort --version | head -1
sort (GNU coreutils) 8.4

Note the different locations of the Japanese kana vs. the English, and the different sort order between two lines that differ only by the single quotes.

To add another variant to the mix, I notice that on a very old FreeBSD 6 box I have, I get the same sort order as OS X:

freebsd-6.0$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /usr/bin/sort unsorted.txt
'foo'
foo
ウ
チ
津

freebsd-6.0$ uname -rs
FreeBSD 6.0-RELEASE

freebsd-6.0$ sort --version | head -1
sort (GNU coreutils) 5.3.0-20040812-FreeBSD

I expected the sort order to be the same in each case, given that all cases are using GNU sort, all with the same locale settings. I tried explictly setting LC_COLLATE separately, and tried using LC_COLLATE=C to force a sort by byte order, but that did not change any results.

Why does my example input file sort differently across OS X and Linux? And how could I force both systems to produce identically sorted text (I don't care which variant, as long as it is consistent between the two)?

Andrew H.
  • 121
  • 1
  • 5
  • 2
    `LC_ALL` overrides `LC_*`. https://www.gnu.org/savannah-checkouts/gnu/libc/manual/html_node/Locale-Categories.html – Cedric Han Dec 10 '14 at 08:39
  • 2
    It seems that setting LC_ALL=C will make the sort order match the OS X and FreeBSD variant on Linux. So that answers the second part of my question (how could I force both systems to produce identically sorted text?). I'm still struggling to understand the first part of the question—why does the input file sort differently in the first place? As @CedricHan pointed out, my assumption that LC_COLLATE would win over LC_ALL for the purposes of sorting was wrong; but how can I duplicate the Linux sort order under OS X? – Andrew H. Dec 10 '14 at 08:57
  • 3
    sort uses collation tables provided by the OS. If different OSes provide different collation tables, then you will get different results. It doesn't make much sense to expect any particular collation order for non-Latin scripts in an English-based locale. `ja_JP.UTF-8` has a different collation order from `en_US.UTF-8` which is different from `C`. – n. m. could be an AI Dec 10 '14 at 09:17
  • It is possible that on mac and on freebsd locale-sensitive collation is disabled (or not supported in the first place, if the system is very old). Try sorting with LC_ALL set to C, en_US.UTF-8 and ja_JP.UTF-8 on all these systems. Throw in some more scripts and stuff, like capital letters and לחיעכ and кйхгф and — and «». It is also possible that on BSDish systems locale names are case sensitive, try UTF-8 instead of utf-8. – n. m. could be an AI Dec 10 '14 at 09:35

1 Answers1

2

As it seems - your linux sort is not preserving proper UTF-8 order.

Hex UTF-8 representations of your unsorted.txt (first letters) would be:

- 30A6

foo - 0066

- 30C1

'foo' - 0027

- 6D25

taken from http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E3%82%A6&mode=char

So proper sorting according to unicode collation (http://www.unicode.org/Public/UCA/latest/allkeys.txt) would be:

'foo' - line 487

foo - line 8966

- line 20875

- line 21004

- not in file

So, to answer your question, your linux machine is providing wrong collation tables to sort function. Unfortunately, i can't tell what is possible reason for that.

PS: There's similar question to yours here.

EDIT

As @ninjalj noticed, glibc doesn't use UCA, but ISO-14651 instead. This bug report suggest migration to UCA. Unfortunately, it's still not resolved.

Also, it could be somehow connected with question about ls case insensivity on MacOSX. Some people even suggest that it has something to do with HFS filesystem.

Community
  • 1
  • 1
Paweł Tomkiel
  • 1,974
  • 2
  • 21
  • 39
  • glibc doesn't use the UCA, it uses ISO-14651. Among other differences, ISO-14651 doesn't provide a default ordering for collation elements with no defined weights on the collation tables. – ninjalj Nov 05 '15 at 19:14
  • Thank you for pointing that out, I'll dig into that tonight to edit my answer. My first catch would be somewhere here https://sourceware.org/bugzilla/show_bug.cgi?id=14095 – Paweł Tomkiel Nov 06 '15 at 10:35
  • 2
    The proper sorting order depends on the locale, there is no such thing as "proper UTF-8 order". – Rasmus Kaj Aug 18 '17 at 15:48