How to match string with diacritic in perl?

Question

For example, match "Nation" in ""Îñţérñåţîöñåļîžåţîöñ" without extra modules. Is it possible in new Perl versions (5.14, 5.15 etc)?

I found an answer! Thanks to tchrist

Rigth solution with UCA match (thnx to https://stackoverflow.com/users/471272/tchrist).

# found start/end offsets for matched utf-substring (without intersections)
use 5.014;
use strict; 
use warnings;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
if (@match) {
    my $found = $match[0];
    my $f_len  = length($found);
    say "match result: $found (length is $f_len)"; 
    my $offset = 0;
    while ((my $start = index($str, $found, $offset)) != -1) {                                                  
        my $end   = $start + $f_len;
        say sprintf("found at: %s,%s", $start, $end);
        $offset = $end + 1;
    }
}

Wrong (but working) solution from http://www.perlmonks.org/?node_id=485681

Magic piece of code is:

    $str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;

code example:

    use 5.014;
    use utf8;
    use Unicode::Normalize;

    binmode STDOUT, ':encoding(UTF-8)';
    my $str  = "Îñţérñåţîöñåļîžåţîöñ";
    my $look = "Nation";
    say "before: $str\n";
    $str = NFD($str);
    # M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html)
    $str =~ s/\pM//og; # remove "marks"
    say "after: $str";¬
    say "is_match: ", $str =~ /$look/i || 0;

I don't know if there is any direct support, but you could canonicalize to Fully Decomposed, then strip any characters with a "joining" property (ISTR there is such a property, though not sure what it's called). — tripleee, Sep 15 '11 at 11:35
googe "perl remove all diacritics" lots of matches which looks promising — Fredrik Pihl, Sep 15 '11 at 11:38
This is the wrong way to do it. You need to use a UCA match at level 1. — tchrist, Sep 15 '11 at 12:36
See also [Text::Unidecode](http://search.cpan.org/perldoc?Text::Unidecode) — ikegami, Sep 15 '11 at 17:08
@Fredrik: **Except that you can’t do it that way!** “Removing all diacritics” fails if you want/expect (for example) `smørrebrød` to be matched by `brod` or for `Óðinn` to be matched by `odin`. With a UCA level-1 match, you *can*. — tchrist, Sep 16 '11 at 01:04

score 7 · Accepted Answer · edited May 23 '17 at 12:08

Right solution with UCA (thnx to tchrist):

# found start/end offsets for matched s
use 5.014;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
say "match ok!" if @match;

P.S. "Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment." © tchrist Why does modern Perl avoid UTF-8 by default?

score 6 · Answer 2 · edited Sep 15 '11 at 22:02

What do you mean by "without extra modules"?

Here is a solution with use Unicode::Normalize; see on perl doc

I removed the "ţ" and the "ļ" from your string, my eclipse didn't wanted to save the script with them.

use strict;
use warnings;
use UTF8;
use Unicode::Normalize;

my $str = "Îñtérñåtîöñålîžåtîöñ";

for ( $str ) {  # the variable we work on
   ##  convert to Unicode first
   ##  if your data comes in Latin-1, then uncomment:
   #$_ = Encode::decode( 'iso-8859-1', $_ );  
   $_ = NFD( $_ );   ##  decompose
   s/\pM//g;         ##  strip combining characters
   s/[^\0-\x80]//g;  ##  clear everything else
 }

if ($str =~ /nation/) {
  print $str . "\n";
}

The output is

Internationaliation

The "ž" is removed from the string, it seems not to be a composed character.

The code for the for loop is from this side How to remove diacritic marks from characters

Another interesting read is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) from Joel Spolsky

Update:

As @tchrist pointed out, there is a algorithm existing, that is better suited, called UCA (Unicode Collation Algorithm). @nordicdyno, already provided a implementation in his question.

The algorithm is described here Unicode Technical Standard #10, Unicode Collation Algorithm

the perl module is described here on perldoc.perl.org

Thanks! In my environment "ž" wasn't removed and all works fine. (vim + Mac OS X + perl 5.14.0) — nordicdyno, Sep 15 '11 at 12:37
This is not the way to do this. You want a level-1 UCA match, which is at the primary strength only and therefore ignores diacritics. — tchrist, Sep 15 '11 at 12:37
@tchrist I already learned a lot from you regarding unicode from your answers and comments (thanks for that), but I think not enough, yet. To be honest I have no idea what you mean with your comment. (What is UCA standing for?) — stema, Sep 15 '11 at 14:00
@tchrist: I will be thankful to you if you could look into this, and suggest a solution or a workaround. Our friends from PerlMonks havne't solved it yet, either: http://stackoverflow.com/questions/13209474/why-is-my-perl-program-failing-with-tiefile-and-unicode-utf-8-encoding — Helen Craigman, Nov 04 '12 at 09:28

How to match string with diacritic in perl?

2 Answers2

Linked