-1

I'm new to perl and having troubles with regex quantifiers on multibyte unicode characters (utf-8) with perl 5, I expect them to count only for one character but they count for as many bytes composing them.

For example, I expect .{1} to match é and .{2} to not match, but I see that :

$ echo 'begin é end' | perl -wnl -e '/begin .{1} end/s and print'

$ echo 'begin é end' | perl -wnl -e '/begin .{2} end/s and print'
begin é end

It is clearly due to "é" being a multibyte character because when I replace it by a simple "e" I get what I expect :

$ echo 'begin e end' | perl -wnl -e '/begin .{1} end/s and print'
begin e end

$ echo 'begin e end' | perl -wnl -e '/begin .{2} end/s and print'

Using some character set modifier (/d /u /a and /l) does not change anything.

When I use another PCRE regex tool it works :

$ echo 'begin é end' | php7 -r 'var_dump(preg_match("/begin .{1} end/su", file_get_contents("php://stdin")));'
Command line code:1:
int(1)

My TTY uses UTF-8 charset, "é" is encoded c3a9 :

$ echo 'begin é end' | xxd
00000000: 6265 6769 6e20 c3a9 2065 6e64 0a         begin .. end.

$ echo 'begin é end' | base64
YmVnaW4gw6kgZW5kCg==

I have tested on several OS and perl versions and I see the same behavior everywhere :

This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-msys-thread-multi-64int   (Windows 7)
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-msys-thread-multi       (Windows 10)
This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi  (Ubuntu 16.04)

How to make perl regex quantifiers counting unicode characters for one ?

Vince
  • 3,274
  • 2
  • 26
  • 28

1 Answers1

1

You need to tell Perl that the input is encoded in UTF-8. That's done by -CI. Add O to encode the output, too:

echo 'begin é end' | perl -CIO -wnl -e '/begin .{1} end/s and print'
begin é end
choroba
  • 231,213
  • 25
  • 204
  • 289