Perl comparing 2 accentuated strings with different encoding(one being read from a UTF8 file)

Question

I am fighting for more than 1 day and Google a lot of requests to fix this problem without any result. :(

Actually I have the following code which read a text file UTF8 encoded with a list of names and my perl script should stop when it finds a specific name. Those names are given in French and have often some accents. That is when it starts behaving unexpectedly:

So here is the code:

#!/usr/bin/perl
$ErrorWordFile = "./myFile.txt";
open FILEcorpus, $ErrorWordFile or die $!;

 while (<FILEcorpus>) 
 {
    chomp;
    $_=~  s/\r|\n//g;
    $normWord=$_;       
        $string="stéphane";

        if( $normWord eq  $string )
        {
          print"\nYES!! does work";

        }
        else
        {
          print"\nNO does NOT work";
        }
}

close(FILEcorpus)

Actually the corpus file (./myFile.txt) contains "stéphane\n" as the only characters.

It obviously comes from the UTF8 encoding of the file and the accents but apparently it is not that easy. I tried a looot of things including

use uft8

or

utf8::decode($normWord); without results

withou any success :(

any idea???

Many thanks for your precious help!

Simon

Please read http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 — innaM, Jul 11 '13 at 16:58

score 3 · Answer 1 · answered Jul 11 '13 at 17:03

3

Try this.

#!/usr/bin/perl
use strict;
use warnings;
use utf8;  # This is needed because of the literal "stéphane" in the below code

my $ErrorWordFile = "./myFile.txt";
open my $FILEcorpus, '<:utf8', $ErrorWordFile or die $!;

while ( my $normWord = <$FILEcorpus> ) {
    chomp $normWord;
    $normWord =~ s/\r|\n//g;
    my $string = "stéphane";

    if ( $normWord eq $string ) {
        print "YES!! does work\n";
    }
    else {
        print "NO does NOT work\n";
    }
}

close $FILEcorpus;

You need to tell Perl that the file you are reading from is UTF-8 and that the string you are comparing it to is UTF-8

answered Jul 11 '13 at 17:03

innaM

47,505
4
67
87

3

Also he may need to Normalize the two strings. `Unicode::Normalize` would be of benefit. – tjd Jul 11 '13 at 17:06
1

I also suspect `use utf8` will solve your problem. If further debugging is needed, look at the individual code points of your strings -- compare `@cp1 = map { ord } split //,$string` with `@cp2 = map { ord } split //, $normWord` ... – mob Jul 11 '13 at 17:06
Actually I suspect that the script is written in latin1. Otherwise, if both script and data file were utf8, then the original script should work, accidentaly. So the only required change is probably `binmode FILEcorpus, ':utf8';`. – Slaven Rezic Jul 11 '13 at 20:01

tjd · Answer 2 · 2013-07-11T17:16:46.777

3

You're currently trying to compare 2 byte strings that may not be normalized.

1: use utf8 will change the string literal in your program from a byte string to a Unicode string

2: open the file as Unicode with <:utf8, so that the input is understood (decoded) as Unicode.

3: use Unicode::Normalize to convert both strings to the same normalized format.

edited Jul 11 '13 at 17:16

answered Jul 11 '13 at 17:10

tjd

4,064
1
24
34

user2573552 · Answer 3 · 2013-07-19T10:24:52.953

Many thanks for your explanation, actually the answer provided by Tjd works fine and helps me a lot (since I was fighting with this problem for long days already!!)

So here is the modified code according to your comments:

#!/usr/bin/perl

use utf8; #ADDED
use Unicode::Normalize; #ADDED

$ErrorWordFile = "./myFile.txt";
open FILEcorpus,'<:utf8',$ErrorWordFile or die $!; #CHANGED

 while (<FILEcorpus>) 
 {
    chomp;
    $_=~  s/\r|\n//g;
    $normWord=$_;       
        $string="stéphane";

    $FCD_string = Unicode::Normalize::NFD($string); #ADDED
    $FCD_normWord = Unicode::Normalize::NFD($normWord); #ADDED

        if( $FCD_normWord eq  $FCD_string )
        {
          print"\nYES!! does work";

        }
        else
        {
          print"\nNO does NOT work";
        }
}

close(FILEcorpus)

so THANKS a lot!!

Sb

Perl comparing 2 accentuated strings with different encoding(one being read from a UTF8 file)

3 Answers3