6

turns out that both of these sequences (previously working)

"`([\n\A;]+)\/\*(.+?)\*\/`ism" => "$1",     // error
"`([\n\A;\s]+)//(.+?)[\n\r]`ism" =>"$1\n",  // error

Now throw an error in PHP 7.3

Warning: preg_replace(): Compilation failed: escape sequence is invalid in character class offset 4

CONTEXT: consider this snipit, which removes CSS comments from a string

$buffer = ".selector {color:#fff; } /* some comment to remove*/";
$regex = array(
"`^([\t\s]+)`ism"=>'',
"`^\/\*(.+?)\*\/`ism"=>"",
"`([\n\A;]+)\/\*(.+?)\*\/`ism"=>"$1",     // 7.3 error
"`([\n\A;\s]+)//(.+?)[\n\r]`ism"=>"$1\n", // 7.3 error
"`(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+`ism"=>"\n"
);
$buffer = preg_replace(array_keys($regex),$regex,$buffer);
//returns cleaned up $buffer value with pure css and no comments

Refer to: https://stackoverflow.com/a/1581063/1293658

Q1 - Any ideas whats wrong with the REGEX in this case? This thread seems to suggest it's simply a misplaced backslash https://github.com/thujohn/twitter/issues/250

Q2 - Is this a PHP 7.3 bug or a problem with the REGEX sequence in this code?

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
Christian Žagarskas
  • 1,068
  • 10
  • 20
  • What are you trying to match with `\A`? If you check your regex with regex101.com you'll see that it doesn't even match the first character class! The regex would match with `\w\s` but I don't really know if it's this what you wanted to match! – csabinho Sep 07 '19 at 01:52
  • 1
    You might wanna place the regex in single quotes. To avoid PHP Escape sequence interpretation. – slepic Sep 07 '19 at 04:06
  • Can you please extract a [mcve]? Also, if it works with 7.2 but fails with 7.3, check the release notes. Maybe the code relies on a bug that was fixed. – Ulrich Eckhardt Sep 07 '19 at 07:02
  • @slepic, using single quotes would require additional steps in order to get newlines and carriage returns in there. In particular, just replacing double quotes with single quotes changes the string content in this case. – Ulrich Eckhardt Sep 07 '19 at 07:04
  • What if you add `(*NO_JIT)` at the start of the pattern? – Wiktor Stribiżew Sep 07 '19 at 07:21
  • All I am "really" trying to do here is take CSS (read into the buffer) and strip out all "/*CSS comments*/" leaving behind pure css. The example given here "works" it strips out all comments starting/ending with /*and*/ – Christian Žagarskas Sep 07 '19 at 20:19

1 Answers1

3

Do not use zero-width assertions inside character classes.

  • ^, $, \A, \b, \B, \Z, \z, \G - as anchors, (non-)word boundaries - do not make sense inside character classes since they do not match any character. The ^ and \b mean something different in the character class: ^ is either the negated character class mark if used after the open [ or denotes a literal ^. \b means a backspace char.

  • You can't use \R (=any line break) there, neither.

The two patterns with \A inside a character class must be re-written as a grouping construct, (...), with an alternation operator |:

"`(\A|[\n;]+)/\*.+?\*/`s"=>"$1", 
"`(\A|[;\s]+)//.+\R`"=>"$1\n", 

I removed the redundant modifiers and capturing groups you are not using, and replaced [\r\n] with \R. The "`(\A|[\n;]+)/\*.+?\*/`s"=>"$1" can also be re-written in a more efficient way:

"`(\A|[\n;]+)/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`"=>"$1"

Note that in PHP 7.3, acc. to the Upgrade history of the bundled PCRE library table, the regex library is PCRE 10.32. See PCRE to PCRE2 migration:

Until PHP 7.2, PHP used the 8.x versions of the legacy PCRE library, and from PHP 7.3, PHP will use PCRE2. Note that PCRE2 is considered to be a new library although it's based on and largely compatible with PCRE (8.x).

Acc. to this resource, the updated library is more strict to regex patterns, and treats former leniently accepted user errors as real errors now:

  • Modifier S is now on by default. PCRE does some extra optimization.
  • Option X is disabled by default. It makes PCRE do more syntax validation than before.
  • Unicode 10 is used, while it was Unicode 7. This means more emojis, more characters, and more sets. Unicode regex may be impacted.
  • Some invalid patterns may be impacted.

In simple words, PCRE2 is more strict in the pattern validations, so after the upgrade, some of your existing patterns could not compile anymore.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I see. so, this REGEX will need to be carefully rewritten. I am not good at Regular expression, any suggestions for stripping out /*CSS comments*/ ? Other than that I would say your answer here is "technically correct" - From what I gather here "\A" (which I assume is "begenning of string") is the problem. I then am not sure how to also target "\n \r new lines" within /*CSS comments*/ IF the segment began with a "new line" – Christian Žagarskas Sep 07 '19 at 20:21
  • @ChristianŽagarskas I added the fixed patterns to the answer. – Wiktor Stribiżew Sep 07 '19 at 21:34
  • outstanding. worked perfectly. After playing with this for a few hours and studying what you have written here I can see I was quite a way off on what I thought needed to change... Thank you for this, my understanding of regular expression has increased. Cheers. (I will be ordering a copy of "Mastering Regular Expressions" based on your other linked comment.) – Christian Žagarskas Sep 08 '19 at 07:34
  • I don't see this working - see https://onlinephp.io/c/5c564 .... what am I doing wrong? – user1432181 Nov 25 '22 at 15:17