2

I'm trying to figure out how to sort an array alphabetically in Perl. Here is what I have that works fine in english:

   # List of countries (kept like this to keep clean, as its re-used in other places)
    my $countries = {
        'AT' => "íAustria",
        'AU' => "Australia",
        'BE' => "Belgium",
        'BG' => "Bulgaria",
        'CA' => "Canada",
        'CY' => "Cyprus",
        'CZ' => "Czech Republic",
        'DK' => "Denmark",
        'EN' => "England",
        'EE' => "Estonia",
        'FI' => "Finland",
        'FR' => "France",
        'DE' => "Germany",
        'GB' => "Great Britain",
        'GR' => "Greece",
        'HU' => "Hungary",
        'IE' => "Ireland",
        'IT' => "Italy",
        'LV' => "Latvia",
        'LT' => "Lithuania",
        'LU' => "Luxembourg",
        'MT' => "Malta",
        'NZ' => "New Zealand",
        'NL' => "Netherlands",
        'PL' => "Poland",
        'PT' => "Portugal",
        'RO' => "Romania",
        'SK' => "Slovakia",
        'SI' => "Slovenia",
        'ES' => "Spain",
        'SE' => "Sweden",
        'CH' => "Switzerland",
        'SC' => "Scotland",
        'UK' => "United Kingdom",
        'US' => "USA",
        'TK' => "Turkey",
        'NO' => "Norway",
        'MX' => "Mexico",
        'IL' => "Israel",
        'IN' => "India",
        'IS' => "Iceland",
        'CN' => "China",
        'JP' => "Japan",
        'VN' => "áVietnamí"
    };
   # Populate the original loop with "name" and "code"
    my @country_loop_orig;
    print $IN->header;
    foreach (keys %{$countries}) {
      push @country_loop_orig, {
        name => $countries->{$lang}->{$_},
        code => $_
      }
    }

   # sort it alphabetically
   my @country_loop = sort { lc($a->{name}) cmp lc($b->{name})  } @country_loop_orig;

This works fine with the English versions:

Australia
Austria
Belgium
Bulgaria
Canada
China
Cyprus
Czech Republic
Denmark
England
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
Iceland
India
Ireland
Israel
Italy
Japan
Latvia
Lithuania
Luxembourg
Malta
Mexico
Netherlands
New Zealand
Norway
Poland
Portugal
Romania
Scotland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
USA
Vietnam

...but when you try and do it with utf8 such as íéó etc, it doesn't work:

Australia
Belgium
Bulgaria
Canada
China
Cyprus
Czech Republic
Denmark
England
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
Iceland
India
Ireland
Israel
Italy
Japan
Latvia
Lithuania
Luxembourg
Malta
Mexico
Netherlands
New Zealand
Norway
Poland
Portugal
Romania
Scotland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
USA
áVietnam
íAustria

How do you achieve this? I found Sort::Naturally::XS, but couldn't get it to work.

Andrew Newby
  • 4,941
  • 6
  • 40
  • 81
  • 2
    `cmp` doesn't know anything about character sets and encodings. It does straight up character (string element) by character (string element) comparisons. (Except possibly under `use locale;`, which you shouldn't use.) – ikegami Oct 07 '17 at 07:10

1 Answers1

6

The Unicode::Collate should help with this.

A simple example that sorts your last list

use warnings;
use strict;
use feature 'say';

use Unicode::Collate;

use open ":std", ":encoding(UTF-8)";

open my $fh, '<', "country_list.txt";
my @list = <$fh>;
chomp @list;

my $uc  = Unicode::Collate->new();
my @sorted = $uc->sort(@list);

say for @sorted;

However, in some languages non-ascii characters may have a very particular accepted placement, and the question doesn't provide any details. Then perhaps Unicode::Collate::Locale can help.

See (study) this perl.com article and this post (T. Christiansen), and this Effective Perler article.


If data to be sorted is in a complex data structure, cmp method is for individual comparison

my @sorted = map { $uc->cmp($a, $b) } @list;

where for $a and $b you'd extract what need be compared from the complex data structure.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Awesome, thanks. How would you go about sorting a hash inside the array? For example I'm doing `$a->{name} cmp $b->{name}`? – Andrew Newby Oct 07 '17 at 06:54
  • BTW, it works perfectly when I sort just an array of the names (without having it as a hash structure). I guess I could re-work how I do the data storage, but I'll wait to see if there is a better way before I spend ages doing that :) – Andrew Newby Oct 07 '17 at 07:02
  • 1
    @AndrewNewby Use `cmp` method, `@s = sort { $uc->cmp($a, $b) } @list;`, for individual comparisons – zdim Oct 07 '17 at 07:07
  • you legend! Works like a charm: `my $uc = Unicode::Collate->new(); my @country_loop = sort { $uc->cmp($a->{name}, $b->{name}) } @country_loop_orig;` – Andrew Newby Oct 07 '17 at 07:10
  • 2
    @AndrewNewby Cool :) Just added a note for it – zdim Oct 07 '17 at 07:12