1

I have a recursive regular expression to get text between brackets []:

    preg_match_all("#\[(([^\[\]]*|(?R))*)\]#", $string, $matches);

It works fine and I've been using it in PHP 5.6 and 7.0 without any trouble. I've upgraded my server to PHP 7.3 and it stopped working for long texts (more than 500,000 characters long).

On a long text containing brackets, the expression returns all results with PHP 5.6 and 7.0, as it should.

With PHP 7.3, it returns an empty $matches array without sending any error or warning message.

I don't know why that is. PCRE is configured the same in all my versions of PHP. The problem only occurs for long texts. I couldn't find any mention of that problem in PHP migration guides.

Jeremie
  • 196
  • 1
  • 11
  • Can you find a way to minimize the failing text, or provide it in full as a link? With regex, it's up to you to decide what "works" and doesn't. Without guidelines and examples about what it should and shouldn't match and where this is failing, it's pretty much guesswork here. – ggorlen Jun 14 '19 at 16:01
  • 1
    Possible duplicate of [RegEx not working for long pattern PCRE's JIT compiler stack limit - PHP7](https://stackoverflow.com/questions/34849485/regex-not-working-for-long-pattern-pcres-jit-compiler-stack-limit-php7) – meyi Jun 14 '19 at 17:01
  • 1
    I understand you may increase the JIT compiler stack limit, but you will still have an inefficient and slow regex. If you use a proper pattern, that limit might not need updating, try `"#\[([^][]*(?:(?R)[^][]*)*)]#"`. Check [this demo](https://regex101.com/r/PjkLMY/2). Or, `"#\[((?:[^][]++|(?R))*)]#"` is also good. – Wiktor Stribiżew Jun 14 '19 at 17:07
  • @Wiktor Your answer gives the best result. Your suggested regexes work for all texts, except for a very long one that still resists. Please turn your comment into an answer so I can mark it as the accepted answer. – Jeremie Jun 16 '19 at 05:27

1 Answers1

1

You can do two things: increase the JIT compiler stack limit and 2) rewrite the regex to follow the unroll-the-loop principle.

The pattern will look like

$regex = "#\[([^][]*(?:(?R)[^][]*)*)]#";

It matches like this:

  • \[ - an open bracket
  • ([^][]*(?:(?R)[^][]*)*) - a capturing group matching
    • [^][]* - zero or more chars other than square brackets
    • (?:(?R)[^][]*)* - zero or more repetitions of the whole regex pattern ((?R)) followed with zero or more chars other than square brackets
  • ] - a close bracket.

See the regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563