The cmp
used with sort
can't help with this; it has no notion of encodings and merely compares by codepoint, character by character, with surprises in many languages. Use Unicode::Collate.† See this post for a bit more and for far more this post by tchrist and this perl.com article.
The other issue is of reading (decoding) input and writing (encoding) output in utf8 correctly. One way to ensure that data on standard streams is handled is via the open pragma, with which you can set "layers" so that input and output is decoded/encoded as data is read/written.
Altogether, an example
use warnings;
use strict;
use feature 'say';
use Unicode::Collate;
use open ":std", ":encoding(UTF-8)";
my $file = ...;
open my $fh, '<', $file or die "Can't open $file: $!";
my @lines = <$fh>;
chomp @lines;
my $uc = Unicode::Collate->new();
my @sorted = $uc->sort(@lines);
say for @sorted;
The module's cmp
method can be used for individual comparisons (if data
is in a complex data structure and not just a flat list of lines, for instance)
my @sorted = map { $uc->cmp($a, $b) } @data;
where $a
and $b
need be set suitably so to extract what to compare from @data
.
If you have utf8 data right in the source you need use utf8
, while if you receive utf8 via yet other channels (from @ARGV
included) you may need to manually Encode::decode those strings.
Please see the linked post (and links in it) and documentation for more detail. See this perlmonks post for far more rounded information. See this Effective Perler article on custom sorting.
† Example: by codepoint comparison ä
> b
while the accepted order in German is ä
< b
perl -MUnicode::Collate -wE'use utf8; binmode STDOUT, ":encoding(UTF-8)";
@s = qw(ä b);
say join " ", sort { $a cmp $b } @s; #--> b ä
say join " ", Unicode::Collate->new->sort(@s); #--> ä b
'
so we need to use Unicode::Collate
(or a custom sort routine).