25

I am trying to get Perl and the GNU/Linux sort(1) program agree on how to sort Unicode strings. I'm running sort with LANG=en_US.UTF-8. In the Perl program I have tried the following methods:

Each one of them failed with the following errors (from the Perl side):

  • Input is not sorted: [----,] came after [($1]
  • Input is not sorted: [...] came after [&]
  • Input is not sorted: [($1] came after [1]

The only method that worked for me involved setting LC_ALL=C for sort, and using 8-bit characters in Perl. However, in this way Unicode strings are not properly ordered.

ikegami
  • 367,544
  • 15
  • 269
  • 518
Diomidis Spinellis
  • 18,734
  • 5
  • 61
  • 83
  • 2
    Are you calling `sort` properly? Unicode::Collate doesn't change the default behavior of `sort`; you have to use a custom comparison function. – cjm Sep 14 '14 at 17:55
  • I'm actually implementing merge-sort in Perl, so I'm not calling Perl's sort function. But, yes, I'm using code such as `print STDERR "Input is not sorted: [$key] came after [$prev]\n" if (defined($prev) && $Collator->cmp($key, $prev) < 0);` – Diomidis Spinellis Sep 14 '14 at 18:29
  • 2
    The actual Perl code (for 8-bit characters) is at https://github.com/dspinellis/sgsh/blob/master/sgsh-merge-sum.pl. It is designed to merge the output of multiple `sort | uniq -c` invocations. – Diomidis Spinellis Sep 14 '14 at 18:36
  • and you're using `export LANG=...; export LC_LOCALE=...`, right? Good luck. – shellter Sep 14 '14 at 19:35
  • 8
    Note that `sort` uses `LC_COLLATE`, not `LANG`. – ikegami Sep 14 '14 at 20:03
  • 2
    See also: http://stackoverflow.com/questions/20226851/how-do-locales-work-in-linux-posix-and-what-transformations-are-applied – ninjalj Sep 14 '14 at 22:54
  • @shellter: yes, I'm exporting the variables. – Diomidis Spinellis Sep 15 '14 at 06:03
  • @ikegami: sort(1) seems to take into account LANG as well. Running "(echo B ; echo a) | LANG=C sort" gives "B a", whereas running "(echo B ; echo a) | LANG=en_US.UTF-8 sort" gives "a B". I agree that LC_COLLATE is more specific. – Diomidis Spinellis Sep 15 '14 at 06:04
  • On linux, LANG has no effect as documented. Both give the same output (`B a` with `LC_COLLATE=C`, `a B` with `LC_COLLATE=en_US.UTF-8`). – ikegami Sep 15 '14 at 15:29
  • 3
    That is to be expected. The precedence is LC_COLLATE, if not defined LC_ALL, if not defined LANG. See http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.html – Diomidis Spinellis Sep 15 '14 at 17:26
  • Side note: Do you know about `sort -m`? I know Unicode sorting should be the same everywhere but different tools may yield different results due to bugs or using different revisions of Unicode (not sure if the sort order may ever change but I think so). – Palec Oct 06 '14 at 22:47
  • Thank you, yes, I know about sort -m, but it doesn't fit my purpose. I want to sum the output of multiple sort | uniq -c runs, and sort -m can't do that. – Diomidis Spinellis Oct 08 '14 at 04:45
  • Which version of Perl are you using? – b4hand Oct 29 '14 at 18:24
  • This is perl 5, version 16, subversion 3 (v5.16.3) built for MSWin32-x86-multi-thread – Diomidis Spinellis Oct 30 '14 at 19:31
  • 1
    While the question says "Perl and the GNU/Linux sort" , the previous comment says "MSWin32". There may be a mismatch if the output is generated on two systems ( linux and windows ) and then compared. Try running both perl and sort on linux only. – Prem May 09 '15 at 18:10
  • Thank you for the suggestion. My question concerns a system that will be distributed as open source; I don't want to dictate where / how people will use it. Rather I was hoping that standards would provide a portable solution. – Diomidis Spinellis May 10 '15 at 19:10
  • Why not use use `sort` through IPC::Run3? – ikegami May 28 '15 at 14:30
  • Because the Perl program's functionality (an extension to sort -m) is not covered by the functionality of sort. The examples in my question are just minimal use cases to demonstrate the problem. I was hoping that standards would offer a way for the two tools to behave in the same way. – Diomidis Spinellis Jun 24 '15 at 14:13

2 Answers2

4

Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what use locale; is for.

I don't know why you didn't get the desired order out of cmp under use locale;.

You could process the decompressed files.

for q in file1.uniqc file2.uniqc ; do
   perl -ne's/^\s*(\d+) //; for $c (1..$1) { print }' "$q"
done | sort | uniq -c

It'll require more temporary storage, of course, but you'll get exactly the order you want.


I found a case use locale; didn't cause Perl's sort/cmp to give the same result as the sort utility. Weird.

$ export LC_COLLATE=en_US.UTF-8

$ perl -Mlocale -e'print for sort { $a cmp $b } <>' data
(
($1
1

$ perl -MPOSIX=strcoll -e'print for sort { strcoll($a, $b) } <>' data
(
($1
1

$ sort data
(
1
($1

Truth be told, it's the sort utility that's weird.


In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the sort utility through IPC::Run3, but it sounds like that's not guaranteed to always result in the same order.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I'm benchmarking performance on a 20GB data set, so I can't afford a suboptimal solution. The case you describe is exactly the type of problem I'm facing. Note that I don't care a lot about the particular locale that will be used, as long as it works reasonably with Unicode strings (e.g. DUCET), and it works the same with sort(1) and Perl. – Diomidis Spinellis Sep 14 '14 at 20:56
  • Re "I'm benchmarking performance on a 20GB data set", So what was the resul? – ikegami Sep 14 '14 at 21:45
  • RE "it works the same with sort(1) and Perl", Is that really true? Do you actually need to use the `sort` utility? – ikegami Sep 14 '14 at 21:47
  • 2
    Doesn't Perl use the UCA for sorting, while glibc uses ISO 14651? – ninjalj Sep 14 '14 at 22:51
  • @ninjalj, I thought locale-based sorting was defined by system files? (I heard about broken locale on machines many times.) – ikegami Sep 15 '14 at 00:03
  • I think this is due to characters without weight in the locale tables. ISO 14651 §6.2.2 says: _The ordering of characters with undefined weights with respect to other characters with undefined weights is not specified in this International Standard_, while the UCA §7.1.3 gives them implicit weights. – ninjalj Sep 15 '14 at 01:30
  • Re: "So what's the [benchmark] result?" On a single node 8-core machine I find sort(1) + Perl to be orders of magnitude faster than Hadoop. But Perl+sort work on 8-bit characters, whereas Hadoop works on Unicode. So I want to level the playing field. – Diomidis Spinellis Sep 15 '14 at 05:52
  • Regarding the difference between ISO 14651 and UCA I read that "The Common Tailorable Template (CTT) datafile of [the ISO 14651] Standard is aligned with the Default Unicode Collation Entity Table (DUCET) datafile of the Unicode Collation Algorithm (UCA) specified in Unicode Technical Standard #10." http://en.wikipedia.org/wiki/ISO_14651 Note that my input consists only of ASCII characters. – Diomidis Spinellis Sep 15 '14 at 05:59
  • No, the proposed solution. You implied it was suboptimal, so you must have tested it, right? – ikegami Sep 15 '14 at 15:26
  • I reasoned based on algorithmic complexity: O(N) of the original vs O(N log N) of the proposed solution, but you're right I should have tested it. Back soon with the benchmark results. – Diomidis Spinellis Sep 15 '14 at 17:40
  • I finished the benchmark. With Perl "uncompressing" the files and then running again "sort | uniq -c" execution time rose from 38:49 to 41:40. – Diomidis Spinellis Sep 15 '14 at 20:37
1

I can't answer directly, but I had problems getting a simple script to sort Serbian Latin text correctly, I found https://www.perl.com/pub/2012/06/perlunicook-demo-of-unicode-collation-and-printing.html/, copied his setup (my actual processing is much simpler than his), and finally got the correct alphabetic sorting for that language and locale. There's about as much as anyone would need to know about Unicode linguistic sorting in the whole set of guides at https://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html/.

I assume you want to sort Greek. Here's a very simple version of what I copied and adapted from the guide, which sorts correctly.

# min required setup for trial sort
use utf8;
use v5.14; # for locale sorting and unicode_strings
use Unicode::Normalize;
use Unicode::Collate::Locale;
my @words = qw{
        Η
        Ιθάκη
        σ'
        έδωσε
        το
        ωραίο
        ταξίδι.
        Χωρίς
        αυτήν
        δεν
        θάβγαινες
        στον
        δρόμο.
};
print "Unsorted: @words\n";
my $coll = Unicode::Collate::Locale->new( locale => "el_GR" );
my @sorted_words = $coll->sort(@words);
print "Sorted: @sorted_words\n";
Peter H
  • 11
  • 2