How does Perl multiple-line regex matching interact with Unicode character properties?

Question

I am processing a multiple-line string, with Unix (\n) line breaks.

Some of its lines have the form "A, a" (i.e. upper-case letter, comma, space, lower-case letter), and I want to delete those from the string.

I can accomplish this with a regex replacement, but there is a mystery that I don't understand:

A regex that uses "[A-Z]" and "[a-z]" works in both normal mode and multiple-line mode.

A regex that uses "\p{Lu}" and "\p{Ll}" works, but only in normal mode, NOT in multiple-line mode.

EACH OF THESE SUCCEEDS:

$all =~ s/\n\K *[A-Z], [a-z]\n//g;    # 1

$all =~ s/^ *[A-Z], [a-z]\n//mg;      # 2

$all =~ s/\n\K *\p{Lu}, \p{Ll}\n//g;  # 3

BUT THIS FAILS:

$all =~ s/^ *\p{Lu}, \p{Ll}\n//mg;    # 4

I expected the /m switch to change the meaning of "^" in the regex, but nothing else. So, I expected statement 4 to work, just like statements 1, 2, and 3. Statement 2 seems to show that the multiple-line syntax is OK, and Statement 3 seems to show that the Unicode character properties match as expected, so, when I combine these, I expect statement 4 to work.

I have looked at Tom Christensen's answer Why does modern Perl avoid UTF-8 by default?, but I don't see anything there about multiple-line regex matching, nor have I found an answer elsewhere.

Give an example where the outcome of #2 and #4 is different. My test, `$all = "foo\n A, x\nmeow";`, has the same outcome for both. — ikegami, Aug 02 '12 at 21:57

ikegami · Answer 1 · 2012-08-03T00:36:50.087

I cannot replicate your problem.

$ perl -wle'
   $all = "foo\n  A, x\nmeow";
   $all =~ s/^ *[A-Z], [a-z]\n//mg;
   print $all;
'
foo
meow

$ perl -wle'
   $all = "foo\n  A, x\nmeow";
   $all =~ s/^ *\p{Lu}, \p{Ll}\n//mg;
   print $all;
'
foo
meow

Tested with 5.8.8, 5.10.1, 5.12.4 (threaded) and 5.16.0 on Linux.

Best guess: pos($all) isn't zero. Perhaps you did something silly like if ($all =~ /.../g).

I couldn't reproduce with the spaces removed either at first.

$ perl -wle'
   $all = "foo\nA, x\nmeow";
   $all =~ s/^ *[A-Z], [a-z]\n//mg;
   print $all;
'
foo
meow

$ perl -wle'
   $all = "foo\n  A, x\nmeow";
   $all =~ s/^ *\p{Lu}, \p{Ll}\n//mg;
   print $all;
'
foo
meow

Tested with 5.10.1 (threaded) on cygwin.

>perl -wle"$all = qq{foo\nA, x\nmeow}; $all =~ s/^ *[A-Z], [a-z]\n//mg; print $all;"
foo
meow

>perl -wle"$all = qq{foo\nA, x\nmeow}; $all =~ s/^ *\p{Lu}, \p{Ll}\n//mg; print $all;"
foo
meow

Tested with 5.14.0 (threaded) and 5.14.2 (threaded) on Windows (ActivePerl).

BUT, AHA!!!!

>perl -wle"$all = qq{foo\nA, x\nmeow}; $all =~ s/^ *[A-Z], [a-z]\n//mg; print $all;"
foo
meow

>perl -wle"$all = qq{foo\nA, x\nmeow}; $all =~ s/^ *\p{Lu}, \p{Ll}\n//mg; print $all;"
foo
A, x
meow

Tested with 5.10.1 (threaded), 5.12.1 (threaded) and 5.12.4 (threaded) on Windows (ActivePerl).

There seems to be a bug in older versions of Perl. It appears to have been fixed in 5.14. The bug appears to be in the optimiser (as seen with -Mre=debug), so it can be bypassed by "disabling" the optimiser.

>perl -wle"$all = qq{foo\nA, x\nmeow}; $all =~ s/^ *\p{Lu}, \p{Ll}\n//mg; print $all;"
foo
A, x
meow

>perl -wle"$all = qq{foo\nA, x\nmeow}; $all =~ s/^ *\p{Lu}{1}, \p{Ll}\n//mg; print $all;"
foo
meow

Thank you for your rapid comment. I have replicated your result (with 5.12.4). However, if I remove the 2 spaces to the left of "A, x" in both scripts, then they behave differently: The first succeeds, and the second fails. — Jonathan Pool, Aug 02 '12 at 23:25
@user1572601: I've tried all 4 regexes on `"foo\nA, x\nmeow"` and all gave the same (expected) result. (Perl l5.14.2) — MRAB, Aug 02 '12 at 23:51
@MRAB, See update. Includes a workaround for this bug apparently fixed in newer versions of Perl. — ikegami, Aug 03 '12 at 00:37

How does Perl multiple-line regex matching interact with Unicode character properties?

1 Answers1