Unicode characters stuffing regex quantifiers in perl 5

Question

I'm new to perl and having troubles with regex quantifiers on multibyte unicode characters (utf-8) with perl 5, I expect them to count only for one character but they count for as many bytes composing them.

For example, I expect .{1} to match é and .{2} to not match, but I see that :

$ echo 'begin é end' | perl -wnl -e '/begin .{1} end/s and print'

$ echo 'begin é end' | perl -wnl -e '/begin .{2} end/s and print'
begin é end

It is clearly due to "é" being a multibyte character because when I replace it by a simple "e" I get what I expect :

$ echo 'begin e end' | perl -wnl -e '/begin .{1} end/s and print'
begin e end

$ echo 'begin e end' | perl -wnl -e '/begin .{2} end/s and print'

Using some character set modifier (/d /u /a and /l) does not change anything.

When I use another PCRE regex tool it works :

regex101 : https://regex101.com/r/a1Lb9g/1/
php 7 (with u modifier to enable unicode support) :

$ echo 'begin é end' | php7 -r 'var_dump(preg_match("/begin .{1} end/su", file_get_contents("php://stdin")));'
Command line code:1:
int(1)

My TTY uses UTF-8 charset, "é" is encoded c3a9 :

$ echo 'begin é end' | xxd
00000000: 6265 6769 6e20 c3a9 2065 6e64 0a         begin .. end.

$ echo 'begin é end' | base64
YmVnaW4gw6kgZW5kCg==

I have tested on several OS and perl versions and I see the same behavior everywhere :

This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-msys-thread-multi-64int   (Windows 7)
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-msys-thread-multi       (Windows 10)
This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi  (Ubuntu 16.04)

How to make perl regex quantifiers counting unicode characters for one ?

Check [How do I match only fully-composed characters in a Unicode string in Perl?](https://stackoverflow.com/questions/203605/) — Wiktor Stribiżew, Sep 25 '20 at 11:37

score 1 · Accepted Answer · answered Sep 25 '20 at 11:39

1

You need to tell Perl that the input is encoded in UTF-8. That's done by -CI. Add O to encode the output, too:

echo 'begin é end' | perl -CIO -wnl -e '/begin .{1} end/s and print'
begin é end

answered Sep 25 '20 at 11:39

choroba

231,213
25
204
289

Unicode characters stuffing regex quantifiers in perl 5

1 Answers1