23

Given a file txt:

ab
a c
a a

When calling sort txt, I obtain:

a a
ab
a c

In other words, it is not proper sorting, it kind of deletes/ignores the whitespaces! I expected this to be the behavior of sort -i but it happens with or without the -i flag.

I would like to obtain "correct" sorting:

a a
a c
ab

How should I do that?

Stevoisiak
  • 23,794
  • 27
  • 122
  • 225
dagnelies
  • 5,203
  • 5
  • 38
  • 56
  • I've created your input file and used sort to provide the desired output without any problems. Was txt created on a *nix system? are you sure they are spaces and not some other kind of character? – marto Aug 03 '11 at 08:20
  • yeah, I actually typed this exact example in my command line ...using ubuntu default install, nearly out-of-the box, without fancy environment tweaking. – dagnelies Aug 03 '11 at 08:24
  • Please mark the correct solution as accepted rather than editing the question to read "Solved". – razlebe Aug 03 '11 at 09:19
  • Actually, that **is** proper sorting. It's called a library or dictionary sort, in which we only look at differences in letters, not in whitespace or punctuation. That's the default mode for the Unicode Collation Algorithm, at least until you hit Level 4. However, it is not the way Unix sort should be acting, because the Unix sort command is field-based, not text-based. – tchrist Aug 03 '11 at 13:07
  • possible duplicate of [unexpected result from gnu sort](http://stackoverflow.com/questions/2691821/unexpected-result-from-gnu-sort) – Cristian Ciupitu Mar 05 '15 at 17:10

7 Answers7

28

Solved by:

export LC_ALL=C

From the sort() documentation:

WARNING: The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

(works for ASCII at least, no idea for UTF8)

Stevoisiak
  • 23,794
  • 27
  • 122
  • 225
dagnelies
  • 5,203
  • 5
  • 38
  • 56
  • 1
    It's because the help menu of `sort` says:: *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. – A. K. Aug 03 '11 at 08:31
  • 2
    @Aditya: ...yeah ...right ...what on earth is a "locale"? :p Why does it affect sorting? Why isn't default ordering used *by default*? (my LC_ALL was empty) Which kind of ordering is used in this case? ...sorry if all this doesn't seem obvious to me. – dagnelies Aug 03 '11 at 08:39
  • Yes that is the C locale, and it is byte-based. Discussed here: http://www.linuxquestions.org/questions/fedora-35/c-locale-and-system-locale-304562/ – Ray Toal Aug 03 '11 at 08:42
  • locale is an interesting concept. setting the locale can let you write non-english characters while you type. There is more you can do. You can have a good idea by looking here: http://www.linux.com/archive/feed/53781 – A. K. Aug 03 '11 at 08:42
  • 2
    ...yeah ...I just noticed `LC_ALL=C` broke my UTF8 character display ...so either I cannot sort them correctly or cannot display them correctly. Yay! – dagnelies Aug 03 '11 at 08:51
  • 2
    You don't have to export LC_ALL, just run it in-line - like `LC_ALL=C sort ...` in a single command. – CmdrMoozy Apr 14 '14 at 16:00
  • 2
    *"What on earth is a 'locale'? Why does it affect sorting? Why isn't default ordering used *by default*?"* -- There is no One Correct Sorting Order. Different people have different notions on how things should be sorted. Some of these are depending on the "locale", e.g. USA, or Germany. Hence, a computer "locale" is an environmental setting that influences sort order, upper-/lowercase conversions, number formats and so on -- so that those functions do whatever that locale considers "default". LC_ALL=C is the smallest common denominator; you are effectively telling the computer to "play dumb". – DevSolar Feb 27 '17 at 10:09
13

Like mentioned before, LC_ALL=C sort does the trick. This is simply because different languages have different rules for sorting characters, which are often laid out by senior linguists instead of CS experts. And these rules, in the case of your locale, seem to say that spaces ought to be ignored in sorting.

By prefixing LC_ALL=C (or, when LC_ALL is unset, LC_COLLATE=C suffices), you explicitely declare language-agnostic sorting (and, with LC_ALL, number-formatting and stuff), which is what you want in this context. If you want to make this your default, export LC_COLLATE in your environment.

The default is chosen in this way to keep consistency with the "normal", real-world sorting schemes (like the white pages), which often ignored spaces.

thiton
  • 35,651
  • 4
  • 70
  • 100
3

Using the C locale i.e. sorting just by byte values is not a good solution in languages where some letters are outside the range [A-Za-z]. Such letters are represented as multiple bytes in UTF-8 and then the byte value collating order is not what one desires. (Some characters may have two equivalent representations (pre-composed and de-composed)).

Nevertheless, the treatment of spaces is a problem. I tried the following:

$ cat stest  
a b  
a c  
ab  
a d  

$ sort stest  
ab  
a b  
a c  
a d  

$ sort -k 1,1 stest  
a b  
a c  
a d  
ab  

For my needs, the -k 1,1 did the trick. Another but clumsier solution I tried, was to change spaces to some auxiliary character, then sort, then change the auxiliaries back into blanks.

jnovo
  • 5,659
  • 2
  • 38
  • 56
koskenni
  • 63
  • 5
3

You could use the 'env' program to temporarily change your LC_COLLATE for the duration of the sort; e.g.

/usr/bin/env LC_COLLATE=POSIX /bin/sort file1 file2

It's a little cumbersome on the command line but if you're using it in a script should be transparent.

Colin
  • 31
  • 1
  • in a script you could define a function: `sort_posix() { env LC_COLLATE=POSIX sort "$@"; }` – myrdd Nov 21 '18 at 23:53
1

I have been looking at this for a little while, wanting to optimize a shell script I maintain that has a heavy international userbase. (heavy as in percentage, not quantity).

Most of the options I saw around the web and SO seem to recommend what I see here, setting the locale globally (overkill)

export LC_ALL=C

or piping it into each individual command like this from gnu.org (tedious)

$ echo abcdefghijklmnopqrstuvwxyz | LC_ALL=C /usr/xpg4/bin/tr 'a-z' 'A-Z' ABCDEFGHIJKLMNOPQRSTUVWXYZ

I wanted to avoid clobbering the user's locale as a unseen side effect of running my program. This turned out to be easily accomplished just as you would expect, by leaving off the globalization. No need to export this variable past your program.

I had to set LANG instead of LC_ALL for some reason, but all the individual locales were set which is functionally enough for me.

Here is the test, simple as can be

#!/bin/bash
# locale_checker.sh

#Check and set locale to LC_ALL to optimize character sort and search.
echo "locale was $LANG"
LANG=C
locale

and output + proof that it is temporary and can be restricted to my script's process.

mateor@:~/snippets$ ./locale_checker.sh
locale was en_US.UTF-8
LANG=C
LANGUAGE=en_US:en
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
mateor@:~/snippets$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

There you go. You get the optimized locale without clobbering another person's innocent environment as well as avoid the tedium of piping it everywhere you think it may help.

mateor
  • 1,293
  • 1
  • 16
  • 19
0

Actually for me

$ cat txt
ab
a c
a a
$ sort txt
a a
a c
ab

I'll bet between your a and c you have a non-breaking space or an enspace or an empspace or other high-codepoint space!

EDIT

Just ran it on Linux. I should have looked at the tags. Yes I get the same output you do! My first run was on the Mac. Looks like a difference between GNU and BSD. I will investigate further.

EDIT 2:

Linux uses a field-based sort.... still looking for how to suppress it. Tried

sort -t, txt

hoping to trick GNU into thinking the whole line was one field, but it still used the current locale to sort.

EDIT 3:

The OP solved the problem by setting the locale to C with

export LC_ALL=C

There seems to be no other approach. The sort command will use the current locale, and although it often says the C (or its alias POSIX) is the default locale, if you have Linux it has probably been set for you. Enter locale -a to see the available locales. On my system:

$ locale -a
C
POSIX
en_AG
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_NG
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZW.utf8

It seems like setting the locale to C (or its alias POSIX) is the only way to break the field-based behavior of sort and treat the whole line as one field. It is rather odd IMHO that this is how to do it. I would think the -t or -k options, or perhaps some new option would be a more sensible way to make this happen.

BTW, it looks like this question has been asked before on SO: unexpected result from gnu sort.

Community
  • 1
  • 1
Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • hmmm ...strange ...i definitely have a plain normal space ...actually I typed the same example in my command line and have different results than you ...how odd. I'm using ubuntu default install btw, nearly out-of-the box, without fancy environment tweaking. – dagnelies Aug 03 '11 at 08:22
  • Well, I am getting the same output as @arnaud. – A. K. Aug 03 '11 at 08:25
  • See my answer to avoid clobbering the user's locale. – mateor Jun 19 '13 at 00:09
0

Weird, works here (cygwin).

Try sort -d txt.

Karoly Horvath
  • 94,607
  • 11
  • 117
  • 176