I'm new to perl and having troubles with regex quantifiers on multibyte unicode characters (utf-8) with perl 5, I expect them to count only for one character but they count for as many bytes composing them.
For example, I expect .{1}
to match é
and .{2}
to not match, but I see that :
$ echo 'begin é end' | perl -wnl -e '/begin .{1} end/s and print'
$ echo 'begin é end' | perl -wnl -e '/begin .{2} end/s and print'
begin é end
It is clearly due to "é" being a multibyte character because when I replace it by a simple "e" I get what I expect :
$ echo 'begin e end' | perl -wnl -e '/begin .{1} end/s and print'
begin e end
$ echo 'begin e end' | perl -wnl -e '/begin .{2} end/s and print'
Using some character set modifier (/d /u /a and /l) does not change anything.
When I use another PCRE regex tool it works :
- regex101 : https://regex101.com/r/a1Lb9g/1/
- php 7 (with
u
modifier to enable unicode support) :
$ echo 'begin é end' | php7 -r 'var_dump(preg_match("/begin .{1} end/su", file_get_contents("php://stdin")));'
Command line code:1:
int(1)
My TTY uses UTF-8 charset, "é" is encoded c3a9
:
$ echo 'begin é end' | xxd
00000000: 6265 6769 6e20 c3a9 2065 6e64 0a begin .. end.
$ echo 'begin é end' | base64
YmVnaW4gw6kgZW5kCg==
I have tested on several OS and perl versions and I see the same behavior everywhere :
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-msys-thread-multi-64int (Windows 7)
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-msys-thread-multi (Windows 10)
This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi (Ubuntu 16.04)
How to make perl regex quantifiers counting unicode characters for one ?