5

I've been puzzled with this when I saw the following files listed by ls in strange order:

Star Wars Episode II - Attack of the Clones (2002) BDRip.mkv
Star Wars Episode III - Revenge of the Sith (2005) BDRip.mkv
Star Wars Episode I - The Phantom Menace (1999) BDRip.mkv
Star Wars Episode IV - A New Hope (1977) BDRip.mkv
Star Wars Episode VI - Return of the Jedi (1983) BDRip.mkv
Star Wars Episode V - The Empire Strikes Back (1980) BDRip.mkv

From human perspective 'I' should go first, then 'II' and so on.

so I created file with the following content:

$ cat 1
Star Wars Episode II - Attack
Star Wars Episode III - Revenge
Star Wars Episode I - The
Star Wars Episode IV - A
Star Wars Episode VI - Return
Star Wars Episode V - The

if I sort it it gives me this:

$ sort 1
Star Wars Episode II - Attack
Star Wars Episode III - Revenge
Star Wars Episode I - The
Star Wars Episode IV - A
Star Wars Episode VI - Return
Star Wars Episode V - The

However, if I remove '-' and everything after it sorts correct:

$ cat 1
Star Wars Episode II 
Star Wars Episode III 
Star Wars Episode I 
Star Wars Episode IV 
Star Wars Episode VI 
Star Wars Episode V 

$ sort 1
Star Wars Episode I 
Star Wars Episode II 
Star Wars Episode III 
Star Wars Episode IV 
Star Wars Episode V 
Star Wars Episode VI 

So, as soon as I add any symbol after space it starts sorting unpredictable for me:

$ cat 1
Star Wars Episode II y
Star Wars Episode III x
Star Wars Episode I z
Star Wars Episode IV w
Star Wars Episode VI v
Star Wars Episode V u

$ sort 1
Star Wars Episode III x
Star Wars Episode II y
Star Wars Episode IV w
Star Wars Episode I z
Star Wars Episode VI v
Star Wars Episode V u

Any hint on this sort behaviour ?

Update: sort: using ‘en_CA.UTF-8’ sorting rules

update #2 as per comment below it is because of locale.

ls | LANG=C sort
Star Wars Episode I - The Phantom Menace (1999) BDRip.mkv
Star Wars Episode II - Attack of the Clones (2002) BDRip.mkv
Star Wars Episode III - Revenge of the Sith (2005) BDRip.mkv
Star Wars Episode IV - A New Hope (1977) BDRip.mkv
Star Wars Episode V - The Empire Strikes Back (1980) BDRip.mkv
Star Wars Episode VI - Return of the Jedi (1983) BDRip.mkv

Why then UTF8 locale makes it different ? I checked with ru_RU.UTF8 (incorrect sorting) and ru_RU.KOI8-R (proper sorting)

Update #3 It is about locale: http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

stimur
  • 255
  • 2
  • 10
  • 4
    Prepending `LC_ALL=C` makes it work so it must have something to do with the locale. – Ramchandra Apte Dec 05 '13 at 16:39
  • 1
    http://www.unix.com/showthread.php?t=156805 scripts for sorting files with roman numerals – andrybak Dec 05 '13 at 16:40
  • Is "ii" a digraph in the ru_RU locale that sorts before "i" (when it's not being considerd a Roman numeral)? A quick Google shows that there have been bugs reported against the ru_RU.UTF8 locale for collation order issues, so it's entirely possible that's part of what you're seeing... – twalberg Dec 05 '13 at 18:08
  • Please, look at my answer below and update to the original question. It is default behaviour of UTF8 locales, at least those I worked with. They ignore whitespaces. My original issue was not related to ru.RU.* locale, but *.UTF8 in general and en_CA.UTF8 in particular. – stimur Dec 05 '13 at 18:18

2 Answers2

2

I think I found the proper explanation of this:

Gnu coreutils FAQ: Sort does not sort in normal order

Found it on: sort not sorting as expected (space and locale)

Community
  • 1
  • 1
stimur
  • 255
  • 2
  • 10
1

It ignores all non-alphanumeric characters when using a locale based sort:

II - Attack   -> "IIA"
III - Revenge -> "III"
I - The       -> "ITh"
IV - A        -> "IVA"
VI - Return   -> "VIR"
V - The       -> "VTh"

With LC_ALL=C, the space character is sorted in front of alphanumerics:

I - The       -> "I -"
II - Attack   -> "II "
III - Revenge -> "III"
IV - A        -> "IV "
V - The       -> "V -"
VI - Return   -> "VI "

So it is coincidence that this works, but it takes 30 more movies for it to actually fail.

Simon Richter
  • 28,572
  • 1
  • 42
  • 64