I tried to look this up in the man pages of the sort
command, but could not find anything.
So consider the following text file t.txt
:
11
1 0
(Binary representation of t.txt
$ xxd -p t.txt
2031310a3120300a
)
using LC_COLLATE="en_US.UTF-8"
with sort
on this file gives:
$ LC_COLLATE="en_US.UTF-8" sort t.txt
1 0
11
If we examine the second character position (or column) in the file, we observe that the first
row has a space, and the second row has a 1
.
Since space has hexadecimal value of 0x20
which is less than the hexadecimal value of 1
(which is 0x31
)
I would assume that sort would give:
11
1 0
It turns out that the expected sorting order can be obtained using LC_COLLATE=c
$ LC_COLLATE=c sort t.txt
11
1 0
What is the reason for the difference between LC_COLLATE="en_US.UTF-8"
and LC_COLLATE=c
for this case?
See also:
- What does “LC_ALL=C” do?
- Why does ls sorting ignore non-alphanumeric characters?
- How do locales work in Linux / POSIX and what transformations are applied?
- Internationalization: Collate (Sort) Order, Character Set, Accents, GLOB patterns
Edit:
Some more information about this issue was found here: