5

Why doesn't "\w" match Unicode word characters (for example, "ğ,İ,ş,ç,ö,ü") in a Perl regular expression?

I tried to include these characters in regular expression m{\w+}g. However, it does not match "ğ,İ,ş,ç,ö,ü".

How can I make this work?

use strict;
use warnings;
use v5.12;
use utf8;

open(MYINPUTFILE, "< $ARGV[0]");

my @strings;
my $delimiter;
my $extensions;
my $id;

while(<MYINPUTFILE>)
{
    my($line) = $_;
    chomp($line);
    print $line."\n";
    unshift(@strings,$line =~ /\w+/g);
    $delimiter = /[._\s]/;
    $extensions = /pdf$|doc$|docx$/;
    $id = /^200|^201/;
}

foreach(@strings){
    print $_."\n";
}

The input file is like:

Çidem_Şener
Hüsnü Tağlip
...

The output goes like:

H�

sn�

Ta�

lip

�

idem_�

ener

In the code, I try to read the file and take each string in the array. (Delimiter can be _ or . or \s).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
erogol
  • 13,156
  • 33
  • 101
  • 155
  • Related but *not* duplicates: http://stackoverflow.com/questions/5555613/does-w-match-all-alphanumeric-characters-defined-in-the-unicode-standard , http://stackoverflow.com/questions/1796573/regex-word-breaker-in-unicode –  Mar 15 '12 at 17:38
  • You need to say something like `open MYINPUTFILE, '<:encoding(UTF-8)', $ARGV[0] ...`. Otherwise your input is raw (octets) and not interpreted as you expect. – mob Mar 15 '12 at 18:04
  • Unrecognized character \xC3; marked by <-- HERE after Muft<-- HERE near column 5 at C:/Users/erogol/Documents/Aptana Studio 3 Workspace/Automata/file.txt line 1. THAN IT GIVES THIS ERROR AND IT DOES NOT RECOGNIZE CHARS "ğü..." – erogol Mar 15 '12 at 18:13
  • Your bug is simply that you are reading the file in binary mode instead of as UTF-8 text. Add this to your code: `use open qw(:std :utf8); use warnings qw(FATAL utf8);`, and things should work a lot better. See [The Perl Unicode Cookbook](http://training.perl.com/scripts/perlunicook.html). – tchrist Mar 15 '12 at 19:52

3 Answers3

3

Make sure that Perl is treating the data as UTF-8.

e.g. if it is embedded in the script itself:

#!/usr/bin/perl

use strict;
use warnings; 
use v5.12;
use utf8;   # States that the Perl program itself is saved using utf8 encoding

say "matched" if "ğİşçöü" =~ /^\w+$/;

That outputs matched. If I remove the use utf8; line, it does not.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • but my is not "ğİşçöü" exactly. They might be in a user name with all the other characters as well. – erogol Mar 15 '12 at 17:42
  • 2
    I used that in the answer for the sake of example. The point is that `\w` will match the characters you want if the string's internal "is UTF8" flag is on. – Quentin Mar 15 '12 at 17:43
3

\w matches any of ğ İ ş ç ö ü just fine.

'ğİşçöü' =~ /\A \w+ \z/msx;     # true

You probably made a mistake and forgot to decode input from octets into Perl characters. I suspect your regex examines stuff on the byte level instead of the character level, like one would expect.

Read http://p3rl.org/UNI and http://training.perl.com/scripts/perlunicook.html to learn about the topic of encoding in Perl.


Edit:

The problem is likely here (I cannot tell for sure without the content of the file):

open(MYINPUTFILE, "< $ARGV[0]");

Find out the encoding of the file, perhaps it's UTF-8 or Windows-1254. Rewrite it, e.g.:

open $in, '<:utf8', $ARGV[0];
open $in, '<:encoding(Windows-1254)', $ARGV[0];

Similarly, printing characters out to STDOUT (near the end of your program) is similarly broken because of the lack of encoding. ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding shows one way how to do it properly.

daxim
  • 39,270
  • 4
  • 65
  • 132
1

Unicode can be a challenge, and Perl has its own peculiarities. Basically, Perl puts up a firewall surrounding all avenues of input/output with regards to Unicode. You have to tell Perl if the path to the I/O has encoding. If it does, the rule is DECODE for any input and/or, ENCODE for any output.

Decoding in converts the data from the {encoding} to the internal representation Perl uses, which is probably a combination of bytes and code points.

Encoding out does just the opposite.

So, it is actually possible to "decode in" and "encode out" to two different encodings. You just have to tell it what it is. The encoding/decoding are usually done via the file I/O layer, but you can use the Encode module (part of the distribution) to manually convert back and forth between encodings.

The perldocs on Unicode is not a light read though.

Here is a sample that might help visualize it (there are many other ways too).

use strict;
use warnings;
use Encode;


# This is an internalized string with these UTF-8 codepoints
# ----------------------------------------------
my $internal_string_1 = "\x{C7}\x{69}\x{64}\x{65}\x{6D}\x{5F}\x{15E}\x{65}\x{6E}\x{65}\x{72}\x{20}\x{48}\x{FC}\x{73}\x{6E}\x{FC}\x{20}\x{54}\x{61}\x{11F}\x{6C}\x{69}\x{70}";


# Open a temp file for writing as UTF-8.
# Output to this file will be automatically encoded from Perl internal to UTF-8 octets.
# Write the internal string.
# Check the file with a UTF-8 editor.
# ----------------------------------------------
open (my $out, '>:utf8', 'temp.txt') or die "can't open temp.txt for writing $!";
print $out $internal_string_1;
close $out;


# Open the temp file for readin as UTF-8.
# All input from this file will be automatically decoded as UTF-8 octets to Perl internal.
# Read/decode to a different internal string.
# ----------------------------------------------
open (my $in, '<:utf8', 'temp.txt') or die "can't open temp.txt for reading $!";
$/ = undef;
my $internal_string_2 = <$in>;
close $in;


# Change the binmode of STDOUT to UTF-8.
# Output to STDOUT will now be automatically encoded from Perl internal to UTF-8 octets.
# Capture STDOUT to a file then check with a UTF-8 editor.
# ----------------------------------------------
binmode STDOUT, ':utf8';
print $internal_string_2, "\n\n";


# Use encode() to convert an internal string to UTF-8 octets
# Format the UTF-8 octets to hex values
# Print to STDOUT
# ----------------------------------------------
my $octets = encode ("utf8", $internal_string_2);
print "Encoded (out) string -> UTF-8 (octets):\n";
print "   length  =  ".length($octets)."\n";
print "   octets  =  $octets\n";
print "   HEX val =  ";
for (split //, $octets) {
    printf ("0x%X ", ord($_));
}
print "\n\n";


# Use decode() to convert external UTF-8 octets to an internal string.
# Format the internal string to codepoints (hex values).
# Print to STDOUT.
# ----------------------------------------------
my $internal_string_3 = decode ("utf8", $octets);
print "Decoded (in) string <- UTF-8 (octets):\n";
print "   length      =  ".length($internal_string_3)."\n";
print "   string      =  $internal_string_3\n";
print "   code points =  ";
for (split //, $internal_string_3) {
    printf ("\\x{%X} ", ord($_));
}

Output

Çidem_Şener Hüsnü Tağlip

Encoded (out) string -> UTF-8 (octets):
   length  =  29
   octets  =  Ãidem_Åener Hüsnü TaÄlip
   HEX val =  0xC3 0x87 0x69 0x64 0x65 0x6D 0x5F 0xC5 0x9E 0x65 0x6E 0x65 0x72 0x20 0x48 0xC3 0xBC 0x73 0x6E 0xC3 0xBC 0x20 0x54 0x61 0xC4 0x9F 0x6C 0x69 0x70

Decoded (in) string <- UTF-8 (octets):
   length      =  24
   string      =  Çidem_Şener Hüsnü Tağlip
   code points =  \x{C7} \x{69} \x{64} \x{65} \x{6D} \x{5F} \x{15E} \x{65} \x{6E} \x{65} \x{72} \x{20} \x{48} \x{FC} \x{73} \x{6E} \x{FC} \x{20} \x{54} \x{61} \x{11F} \x{6C} \x{69} \x{70}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131