Sorting UTF-8 input

Question

I need to sort lines from file, saved as UTF-8. These lines can start with cyrillic or latin characters. My code works wrong on cyrillic one.

sub sort_by_default  {
    my @sorted_lines = sort {
        $a <=> $b
          ||
        fc( $a) cmp fc($b)
     } @_;
}

open(FILE, "$address") or die "Can't open file: $!\n"; my @file = ; close (FILE); — D.123perl456, Mar 10 '18 at 23:45
Where's the rest of the demonstration of the problem??? What problem are you having??? — ikegami, Mar 11 '18 at 01:29

zdim · Answer 1 · 2022-01-31T03:02:51.063

The cmp used with sort can't help with this; it has no notion of encodings and merely compares by codepoint, character by character, with surprises in many languages. Use Unicode::Collate.^† See this post for a bit more and for far more this post by tchrist and this perl.com article.

The other issue is of reading (decoding) input and writing (encoding) output in utf8 correctly. One way to ensure that data on standard streams is handled is via the open pragma, with which you can set "layers" so that input and output is decoded/encoded as data is read/written.

Altogether, an example

use warnings;
use strict;
use feature 'say';

use Unicode::Collate;

use open ":std", ":encoding(UTF-8)";

my $file = ...;

open my $fh, '<', $file or die "Can't open $file: $!";
my @lines = <$fh>;
chomp @lines;

my $uc  = Unicode::Collate->new();
my @sorted = $uc->sort(@lines);

say for @sorted;

The module's cmp method can be used for individual comparisons (if data is in a complex data structure and not just a flat list of lines, for instance)

my @sorted = map { $uc->cmp($a, $b) } @data;

where $a and $b need be set suitably so to extract what to compare from @data.

If you have utf8 data right in the source you need use utf8, while if you receive utf8 via yet other channels (from @ARGV included) you may need to manually Encode::decode those strings.

Please see the linked post (and links in it) and documentation for more detail. See this perlmonks post for far more rounded information. See this Effective Perler article on custom sorting.

^† Example: by codepoint comparison ä > b while the accepted order in German is ä < b

perl -MUnicode::Collate -wE'use utf8; binmode STDOUT, ":encoding(UTF-8)"; 
    @s = qw(ä b); 
    say join " ", sort { $a cmp $b } @s;             #-->  b ä
    say join " ", Unicode::Collate->new->sort(@s);   #-->  ä b
'

so we need to use Unicode::Collate (or a custom sort routine).

*"The `cmp` used with `sort` can't help with this; it has no notion of encodings and merely compares character by character"* I would have thought that was exactly what you wanted as long as the strings aren't encoded. — Borodin, Mar 11 '18 at 01:45
@Borodin But with strings that come from an encoding other than ascii that will result in a wrong sort? The linked post is a good example of this problem, I think. — zdim, Mar 11 '18 at 01:48
The main thing to be aware of is that all the Unicode *graphemes* should be normalised to either *composed* or *decomposed* (i.e. as single characters or as basic characters followed by a combining mark). I don't see how the original encoding can affect that. — Borodin, Mar 11 '18 at 02:12

score 2 · Answer 2 · answered Mar 10 '18 at 23:56

To open a file saved as UTF-8, use the appropriate layer:

open my $FH, '<:encoding(UTF-8)', 'filename' or die $!;

Don't forget to set the same layer for the output.

#! /usr/bin/perl
use warnings;
use strict;

binmode *DATA, ':encoding(UTF-8)';
binmode *STDOUT, ':encoding(UTF-8)';
print for sort <DATA>;

__DATA__
Борис
Peter
John
Владимир

Javier Elices · Answer 3 · 2018-03-11T17:02:16.053

The key to handle UTF-8 correctly in Perl is to make sure that Perl knows that a certain source or destination of information is in UTF-8. This is done differently depending on the way you get info in or out. If the UTF-8 is coming from an input file, the way to open the file is:

open( my $fh, '<:encoding(UTF-8)', "filename" ) or die "Cannot open file: $!\n";

If you are going to have UTF-8 inside the source of your script, then make sure you have:

use utf8;

At the beginning of the script.

If you are going to get UTF-8 characters from STDIN, use this at the beginning of the script:

binmode(STDIN, ':encoding(UTF-8)');

For STDOUT use:

binmode(STDOUT, ':encoding(UTF-8)');

Also, make sure you read UTF-8 vs. utf8 vs. UTF8 to know the difference between each encoding name. utf8 or UTF8 will allow valid UTF-8 and also non-valid UTF-8 (according to the first UTF-8 proposed standard) and will not complain about non-valid codepoints. UTF-8 will allow valid UTF-8 but will not allow non-valid codepoint combinations; it is a short name for utf-8-strict. You may also read the question How do I sanitize invalid UTF-8 in Perl? .

Finally, following @zdim advise, you may use at the beginning of the script:

use open ':encoding(UTF-8)';

And other variants as described here. That will set the encoding layer for all open instructions that do not specify a layer explicitly.

Or you can use `open pragma` for all standard streams – zdim Mar 11 '18 at 01:39 — zdim, Mar 11 '18 at 01:39

Sorting UTF-8 input

3 Answers3

Linked