Capitalizing strings which contain accented characters

Question

I'm trying to find a solution for capitalising names in a perl webapp (using perl v5.10.1). I originally thought to use Lingua::EN::NameCase, but am seeing some problems with accented characters.

I need to be able to deal with accented characters from a variety of european languages (irish, french, german).

I have seen some indications online that Lingua::EN::NameCase should work for my usecase. For example, this page on perlmonks: http://www.perlmonks.org/?node_id=889135

Here is my test code based on above link:

#!/usr/bin/perl

use strict;
use warnings;
use Lingua::EN::NameCase;
use locale;
use POSIX qw(locale_h);

my $locale = 'en_FR.utf8';

setlocale( LC_CTYPE, $locale );

binmode DATA,   ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

while (my $original_name = <DATA>) {
    chomp $original_name;
    my $normalized_name = nc($original_name);
    printf "%30s L::EN::NC %30s UCFIRST %30s\n", $original_name, $normalized_name, xlc($original_name);
}

sub xlc {
    my $str = shift;
    $_ = lc( $str );
    return join q{} => ( map { ucfirst(lc($_)) } ( $str =~ m/(\W+|\w+)/g ) );
};

__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh

Produces the output below. Both L::EN::NC and the custom ucfirst(lc()) solution produce incorrect results (note the capital letters following each accented character). This seems to be because perl regex is matching a "word boundary" before/after each accented character. I would have expected word boundary only to match between a space character and a non-space character.

Can anybody suggest a solution?

Thanks,

Brian.

  ÉTIENNE DE LA BOÉTIE L::EN::NC           éTienne de la BoéTie UCFIRST           ÉTienne De La BoÉTie
    ÉMILIE DU CHÂTELET L::EN::NC             éMilie du ChâTelet UCFIRST             ÉMilie Du ChÂTelet
         HÉLÈNE CIXOUS L::EN::NC                  HéLèNe Cixous UCFIRST                  HÉLÈNe Cixous
    Seán Ó Hannracháín L::EN::NC             SeáN ó HannracháíN UCFIRST             SeÁN ó HannrachÁíN
    Máire Ó hÓgartaigh L::EN::NC             MáIre ó HóGartaigh UCFIRST             MÁIre ó HÓGartaigh

See [Uppercase accented characters in Perl](http://stackoverflow.com/questions/13261522/uppercase-accented-characters-in-perl) — hwnd, Oct 16 '13 at 06:53
That link hwnd posted is interesting, but the utf8 flag *is set* on `$original_name`: everything is properly decoded. — amon, Oct 16 '13 at 10:00
Indeed. I do not have a problem with capitalisation _per se_. uc() and lc() seem to work fine on any strings I send to them. The problem is that L::EN::NC does not seem to be able to correctly identify the start of a word in order to capitalise the first letter of that word. The relevant regex from L::EN::NC is `s{ \b (\w) }{\u$1}gox ;`, which uses `\b` to identify word boundaries. For me `\b` seems to identify any change between accented char and non-accented char as a word boundary, which seems wrong to me. — Brian Foley, Oct 16 '13 at 19:39
possible duplicate of [Perl Unicode test on OS X fails on Debian](http://stackoverflow.com/questions/18827733/perl-unicode-test-on-os-x-fails-on-debian) – but I'm not quite sure. An `en_*.*`-locale simply does not consider `é` to be in `\w`. — amon, Oct 16 '13 at 20:00
@amon, thank you for the followup. The question you pointed me to seems to be the same issue, and started me thinking why certain locales would not consider `é` to be in `\w`. Long story short, changing my locale in the original example to either `en_IE` or `fr_FR` solves the original issue. Reference to perlre in the other question leads me to believe that `use feature 'unicode_strings'` might have also solved my problem (by treating accented chars as part of \w), but I am not on a new enough perl to use that feature. — Brian Foley, Oct 16 '13 at 20:25
For pointers (a lot!) on the use of UTF-8 and Unicode in Perl see http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default — Kwebble, Oct 17 '13 at 14:04

score 1 · Answer 1 · answered Feb 18 '14 at 19:41

Perl 5.10 is old; you should update it, if you can.

Next you'll find a version I use for similar situations. (tested in a perl 5.14.2)

#!/usr/bin/perl

use strict;
use warnings;
use utf8::all;

while (<DATA>) { chomp;
    printf "%30s ==> %30s\n", $_, xlc($_);
}

sub xlc { my $str = shift;
    $str =~ s/(\w+)/ucfirst(lc($1))/ge;
    $str =~ s/( L[ea]s?
               | Von
               | D[aeou]s?
               )\b
              /lc($1)/xge;
    return $str;
};

__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh

just noticed we almost gave the same answer. but you're first. so here's my upvote :) — Pierre, Jun 11 '14 at 06:57

score 0 · Answer 2 · answered Oct 16 '13 at 14:42

0

If your data is in UTF8, you should decode it to perl's internal encoding:

    utf8::decode($original_name);
    my $normalized_name = nc($original_name);
    printf "%30s L::EN::NC %30s UCFIRST %30s\n", $original_name, $normalized_name, xlc($original_name);

answered Oct 16 '13 at 14:42

Bohdan

414
3
13

Thanks Bohdan. My data is indeed UTF8 -- utf8::is_utf8( $original_name ) returns true. however utf8::decode() does not give me the desired output. It does change the output though. So instead of L::EN::NC giving "éTienne de la BoéTie" in my original example, it now gives "ÉTienne de la BoÉTie". So capitalisation has changed, but I still have spurious capitals mid-string. – Brian Foley Oct 16 '13 at 19:32

score 0 · Answer 3 · answered Apr 15 '14 at 01:52

OK, I just got your script to work. Here's the output I got:

      ÉTIENNE DE LA BOÉTIE L::EN::NC           Étienne de la Boétie UCFIRST           Étienne De La Boétie
        ÉMILIE DU CHÂTELET L::EN::NC             Émilie du Châtelet UCFIRST             Émilie Du Châtelet
             HÉLÈNE CIXOUS L::EN::NC                  Hélène Cixous UCFIRST                  Hélène Cixous
        Seán Ó Hannracháín L::EN::NC             Seán Ó Hannracháín UCFIRST             Seán Ó Hannracháín
        Máire Ó hÓgartaigh L::EN::NC             Máire Ó Hógartaigh UCFIRST             Máire Ó Hógartaigh

I had to change two things:

I commented out the binmode calls since they were not needed with whatever encoding my emacs used on my system. Your mileage may vary. If you get it wrong, you'll see warnings about characters that don't map to Unicode or wide characters.
I changed the local. You were telling it to use an English-speaking locale in France. I'm not sure that's a valid locale. I picked a local which actually uses accented characters.

Unfortunately, locale names are not standardized, but the following locale worked for me:

my $locale = 'fr_FR.utf-8';

In particular, it did not work without the hyphen.

Pierre · Answer 4 · 2014-06-11T06:52:02.730

0

Actually you just need the utf8 pragma.

use utf8;
binmode STDOUT, ':utf8'; 

while (my $name = <DATA>) {
    $name =~ s/(\w+)/ucfirst lc $1/eg;
    print $name;
}

__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh

I get:

Étienne De La Boétie
Émilie Du Châtelet
Hélène Cixous
Seán Ó Hannracháín
Máire Ó Hógartaigh

edited Jun 11 '14 at 06:52

answered Jun 10 '14 at 22:44

Pierre

1,204
8
15

Capitalizing strings which contain accented characters

4 Answers4