4

I am processing strings encoded in utf-8, in Perl. One task is that I need a way to know that words starting with a letter with a diacritic, such as "écrit", begin with the same letter as "elephant", and also "England". I need a general solution, since I will be working across several languages. I need to know this because I am creating letter headings, for an index. Each of the words I just mentioned would be stored under "E".

Is there a straightforward way to do this?

egilchri
  • 761
  • 1
  • 5
  • 19

3 Answers3

3

Text::Unidecode can help you. It translates Unicode to ASCII.

$ perl -Mutf8 -e 'use Text::Unidecode; print unidecode("écrit")'
ecrit
mob
  • 117,087
  • 18
  • 149
  • 283
choroba
  • 231,213
  • 25
  • 204
  • 289
  • Thanks. That should get me started as I work my away across my backlog of world languages. :-) – egilchri Feb 21 '13 at 18:18
  • Although on second thought, it probably will get a bit tricky as I get into languages whose "Index Characters" are not ASCII. The expanded description of what I'm doing is sorting book indexes (indices?) and then grouping the sorted words into categories. – egilchri Feb 21 '13 at 18:25
  • I don't recommend Text::Unidecode, it is a measure of last resort! We have better tools for solving this kind of problem. – daxim Feb 24 '13 at 16:51
2

Equality and order of strings is determined by things called collations. The tricky part is that they depend on the language and culture (the technical term is "locale"). For example you may consider ø and o equivalent, but to a Dane the are different letters and must be ordered differently.

The Perl module for working with collations is Unicode::Collate.

Update: You can also use Perl's built-in locale support with use locale:

use locale; 
use POSIX qw(setlocale LC_ALL);

setlocale(LC_ALL, ''); # Set default locale from environment variables

This makes builtins such as sort and cmpuse the locale's rules for ordering strings. But be careful; changing the locale of a program may have unexpected consequences, like changing the decimal point to comma in printf output.

Update 2: The POSIX locales are apparently broken in various ways. You're better off using Unicode::Collate and Unicode::Collate::Locale.

Joni
  • 108,737
  • 14
  • 143
  • 193
  • "use locale" is broken ([1](http://stackoverflow.com/q/14942652), [2](http://stackoverflow.com/q/14863899)), I recommend against using POSIX locales. They are obsoleted by the advent of Unicode. Only use the Unicode collation modules instead. – daxim Feb 24 '13 at 16:52
  • Thanks @daxim, I haven't used POSIX locales extensively in Perl and wasn't aware of the problems. – Joni Feb 25 '13 at 17:50
1

I'm making the assumption that you are sorting by English collation rules and have alphabetic text. The code below is a good start, but the real world is more complicated than that. (For example, Chinese text has different lexicographic rules depending on the context, e.g. general-purpose dictionary, karaoke song lists, electronic door bell name list, …) I cannot present a perfect solution because the question had so little information.

use 5.010;
use utf8;
use Unicode::Collate::Locale 0.96;
use Unicode::Normalize qw(normalize);

my $c = Unicode::Collate::Locale->new(locale => 'en');
say for $c->sort(qw(
    eye
    egg
    estate
    etc.
    eleven
    e.g.
    England
    ensure
    educate
    each
    equipment
    elephant
    ex-
    ending
    écrit
));
say '-' x 40;
for my $word (qw(écrit Ëmëhntëhtt-Rê Ênio ècole Ēadƿeard Ėmma Ędward Ẽfini)) {
    say sprintf '%s should be stored under the heading %s',
        $word, ucfirst substr normalize('D', $word), 0, 1;
}

__END__
each
écrit
educate
e.g.
egg
elephant
eleven
ending
England
ensure
equipment
estate
etc.
ex-
eye
----------------------------------------
écrit should be stored under the heading E
Ëmëhntëhtt-Rê should be stored under the heading E
Ênio should be stored under the heading E
ècole should be stored under the heading E
Ēadƿeard should be stored under the heading E
Ėmma should be stored under the heading E
Ędward should be stored under the heading E
Ẽfini should be stored under the heading E
daxim
  • 39,270
  • 4
  • 65
  • 132