5

I am getting PREG_JIT_STACKLIMIT_ERROR error in preg_replace_callback() function when working with a bit longer string. Above 2000 characters it is not woking (above 2000 characters that match regex, not 2000 character string).
I've read already that it's caused by inefficient regex, but I can't make my regex simpler. Here's my regex:

/\{@([a-z0-9_]+)-((%?[a-z0-9_]+(:[a-z0-9_]+)*)+)\|(((?R)|.)*)@\}/Us

It should match strings like these:

1) {@if-statement|echo this|echo otherwise@}

2) {@if-statement:sub|echo this|echo otherwise@}

3) {@if-statement%statament2:sub|echo this@}

and also nested like this:

4) {@if-statement|echo this| {@if-statement2|echo this|echo otherwise@} @}

I've tried to simplify it to:

/\{@([a-z0-9_]+)-([a-z0-9_]+)\|(((?R)|.)*)@\}/Us

But it looks like error is caused by (((?R)|.)*) part. Any advice?

Code for testing:

$string = '{@if-is_not_logged_homepage|
<header id="header_home">
    <div class="in">
        <div class="top">
            <h1 class="logo"><a href="/"><img src="/img/logo-home.png" alt=""></a></h1>
            <div class="login_outer_wrapper">
                <button id="login"><div class="a"><i class="stripe"><i></i></i>Log in</div></button>
                <div id="login_wrapper">
                    <form method="post" action="{^login^}" id="form_login_global">
                        <div class="form_field no_description">
                            <label>{!auth:login_email!}</label>
                            <div class="input"><input type="text" name="form[login]"></div>
                        </div>
                        <div class="form_field no_description password">
                            <label>{!auth:password!}</label>
                            <div class="input"><input type="password" name="form[password]"></div>
                        </div>
                        <div class="remember">
                            <input type="checkbox" name="remember" id="remember_me_check" checked>
                            <label for="remember_me_check"><i class="fa fa-check" aria-hidden="true"></i>Remember</label>
                        </div>
                        <div class="submit_box">
                            <button class="btn btn_check">Log in</button>
                        </div>
                    </form>
                </div>
            </div>
        </div>
        <div class="content clr">
            <div class="main_menu">
                <a href="">
                    <i class="ico a"><i class="fa fa-lightbulb-o" aria-hidden="true"></i></i>
                    <span>Idea</span>
                    <div>&nbsp;</div>
                </a>
                <a href="">
                    <i class="ico b"><i class="fa fa-user" aria-hidden="true"></i></i>
                    <span>FFa</span>
                </a>
                <a href="">
                    <i class="ico c"><i class="fa fa-briefcase" aria-hidden="true"></i></i>
                    <span>Buss</span>
                </a>
            </div>
            <div class="text_wrapper">

                <div>
                    <div class="register_wrapper">
                        <a id="main_register" class="btn register">Załóż konto</a>
                        <form method="post" action="{^login^}" id="form_register_home">
                            <div class="form_field no_description">
                                <label>{!auth:email!}</label>
                                <div class="input"><input type="text" name="form2[email]"></div>
                            </div>
                            <div class="form_field no_description password">
                                <label>{!auth:password!}</label>
                                <div class="input tooltip"><input type="password" name="form2[password]"><i class="fa fa-info-circle tooltip_open" aria-hidden="true" title="{!auth:password_format!}"></i></div>

                            </div>
                            <div class="form_field terms no_description">
                                <div class="input">
                                    <input type="checkbox" name="form2[terms]" id="terms_check">
                                    <label for="terms_check"><i class="fa fa-check" aria-hidden="true"></i>Agree</label>
                                </div>
                            </div>
                            <div class="form_field no_description">
                                <div class="input captcha_wrapper">
                                    <div class="g-recaptcha" data-sitekey="{%captcha_public_key%}"></div>
                                </div>
                            </div>
                            <div class="submit_box">
                                <button class="btn btn_check">{!auth:register_btn!}</button>
                            </div>
                        </form>
                    </div>
                </div>
            </div>
        </div>
    </div>
</header>
@}';

$if_counter = 0;

$parsed_view = preg_replace_callback( '/\{@([a-z0-9_]+)-((%?[a-z0-9_]+(:[a-z0-9_]+)*)+)\|(((?R)|.)*)@\}/Us',
        function( $match ) use( &$if_counter ){
            return '<-{'. ( $if_counter ++ ) .'}->';
        }, $string );


var_dump($parsed_view); // NULL
instead
  • 3,101
  • 2
  • 26
  • 35
  • Check http://stackoverflow.com/questions/34849485/regex-not-working-for-long-pattern-pcres-jit-compiler-stack-limit-php7 – Wiktor Stribiżew Sep 25 '16 at 10:43
  • http://php.net/manual/en/pcre.configuration.php – Deep Sep 25 '16 at 10:43
  • @WiktorStribiżew I've sen it already but I'm not sure if using: `ini_set('pcre.jit', false);`is a way to go... It's like using `@` when You want to hide error. – instead Sep 25 '16 at 10:48
  • Could you provide a live demo with input that makes this error happen? – revo Sep 25 '16 at 10:52
  • I see a counter inside `preg_replace_callback`. Is it intended to count all occurrences of that pattern? If yes then nested blocks aren't taken into account. Is it right? – revo Sep 25 '16 at 11:54
  • @revo no no... it's not for this. It's just simplified code inside anonymous function. – instead Sep 25 '16 at 12:04
  • Then I believe you can make things better: [check this.](https://regex101.com/r/aL6rN0/2) – revo Sep 25 '16 at 12:11

2 Answers2

8

What is PCRE JIT?

Just-in-time compiling is a heavyweight optimization that can greatly speed up pattern matching. However, it comes at the cost of extra processing before the match is performed. Therefore, it is of most benefit when the same pattern is going to be matched many times.

and how does it work basically?

PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack where the local data of the current node is pushed before checking its child nodes... When the compiled JIT code runs, it needs a block of memory to use as a stack. By default, it uses 32K on the machine stack. However, some large or complicated patterns need more than this. The error PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack.

By first quote you will understand JIT is an optional feature that is on by default in PHP [v7.*] PCRE. So you can easily turn it off: pcre.jit = 0 (it's not recommended though)

However, while receiving error code #6 of preg_* functions it means possibly JIT hits the stack size limit.

Since capturing groups consume more memory than non-capturing groups (even more memory is intended to be used as per type of quantifier(s) of clusters):

  1. Capturing group OP_CBRA (pcre_jit_compile.c:#1138) - (real memory is more than this):
case OP_CBRA:
case OP_SCBRA:
bracketlen = 1 + LINK_SIZE + IMM2_SIZE;
break;
  1. Non-capturing group OP_BRA (pcre_jit_compile.c:#1134) - (real memory is more than this):
case OP_BRA:
bracketlen = 1 + LINK_SIZE;
break;

Therefore changing capturing groups to non-capturing groups in your own RegEx makes it to give proper output (which I don't know exactly how much memory is saved by that)

But it seems you need capturing groups and they are necessary. Then you should re-write your RegEx for the sake of performance. Backtracking is almost everything in a RegEx that should be considered.

Update #1

Solution:

(?(DEFINE)
  (?<recurs>
    (?! {@|@} ) [^|] [^{@|\\]* ( \\.[^{@|\\]* )* | (?R)
  )
)
{@
(?<If> \w+)-
(?<Condition> (%?\w++ (:\w+)*)* )
(?<True> [|] [^{@|]*+ (?&recurs)* )
(?<False> [|] (?&recurs)* )?
\s*@}

Live demo

PHP code (watch backslash escaping):

preg_match_all('/(?(DEFINE)
  (?<recurs>
    (?! {@|@} ) [^|] [^{@|\\\\]* ( \\\\.[^{@|\\\\]* )* | (?R)
  )
)
{@
(?<If> \w+ )-
(?<Condition> (%?\w++ (:\w+)*)* )
(?<True> [|] [^{@|]*+ (?&recurs)* )
(?<False> [|] (?&recurs)* )?
\s*@}/x', $string, $matches);

This is your own RegEx that is optimized in a way to have least backtracking steps. So whatever was supposed to be matched by your own one is matched by this too.

RegEx without following nested if blocks:

{@
(?<If> \w+)-
(?<Condition> (%?\w++ (:\w+)*)* )
(?<True> [|] [^|\\]* (?: \\.[^|\\]* )* )
(?<False> [|] \X*)?
@}

Live demo

Most of quantifiers are written possessively (avoids backtrack) by appending + to them.

revo
  • 47,783
  • 14
  • 74
  • 117
  • I'm not sure why, but when I paste Your regex into my PHP script it returns NULL. I added / at the beginning and the end, and also placed x flag. Expression that You provide in a comment works... – instead Sep 25 '16 at 19:21
  • This one works. Thank You. I'll check if that's what I need – instead Sep 25 '16 at 19:39
  • Also I made an update for matching infinite occurrences of pattern `:sub%statement2`. Please check. – revo Sep 25 '16 at 19:48
  • And if you don't need to match nested `if` blocks it can get shorter. – revo Sep 25 '16 at 19:55
  • Not nested are simple, and I have working and efficient example, at least enough efficient to work with my code. I need them nested to be easier to working with. But I didn't expect that regex have to be so advanced... – instead Sep 25 '16 at 20:01
  • Ohh now after testing I see what did You mean by matching nested function. I didn't understand. Anyway this one is cool but could You please add to Your answer option without matching nested? And we're on the way to go. – instead Sep 25 '16 at 23:20
  • 1
    I added a RegEx that doesn't follow nested `if` blocks. Please check. – revo Sep 26 '16 at 10:34
3

The problem as you can see is that your pattern is inefficient. The main reasons are:

  • You use this kind of subpatterns: (a+)+b that is the best way for a catastrophic backtracking
  • You use this kind of subpatterns too: (a|b)+ that may be a good design except for a backtracking regex engine like pcre
  • You use the U modifier for an unknown reason that makes all your quantifiers non-greedy and generates a lot of useless tests

As an aside, there are too much useless capture groups that consumes memory for nothing. When you don't need a capture group, don't write it. If you really need to group elements, use a non-capturing group, but don't use non-capturing groups to make a pattern "more readable" (there are other ways to do that like named groups, free-spacing and comments).


If I understand well, you are trying to build a regex for preg_replace_callback to deal with the control statement of your template system. Since these control statements can be nested and a regex engine can't match several times the same substring, you have to choose between several strategies:

  1. You can write a recursive pattern to describe a conditional statement that eventually contains other conditional statements.

  2. You can write a pattern that matches only the innermost conditional statements. (In other words it forbids nested conditional statements.)

In the two cases, you need to parse the string several times until there's nothing to replace. (Note that you can also use a recursive function with the first strategy, but it makes things more complicated.)

Let's see the second way:

$pattern = '~
{@ (?<cond> \w+ ) - (?<stat> \w+ (?: % \w+ )* ) (?: : (?<sub> \w+ ) )? \|

# a "THEN" part that doesn\'t have nested conditional statements
(?<then> [^{|@]*+ (?: { (?!@) [^{|@]* | @ (?!}) [^{|@]* )*+ )

# optional "ELSE" part (the content is similar to the "THEN" part)
(?: \| (?<else> \g<then> ) )? (*SKIP) @}~x';

$parsed_view = $string;
$count = 0;

do {
    $parsed_view = preg_replace_callback($pattern, function ($m) {
        // do what you need here. The different captures can be
        // easily accessed with their names: $m['cond'], $m['stat']...
        // as defined in the pattern.
        return $result;
    }, $parsed_view, -1, $count);
} while ($count);

pattern demo

As you can see the problem of nested statements is solved with the do..while loop and the count parameter of preg_replace_callback to see if something is replaced.

This code isn't tested, but I'm sure you can complete it and eventually adapt it to your needs.


As an aside, there's a lot of template engines that already exists (and PHP is already a template engine). You can use them and avoid to create your own syntax. You can also take a look at their codes.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • FWIW, i followed the same approach once with a similar template engine (embedded in LaTeX code) and failed hard. It works, as it mitigates the PCRE backtrack limit. But it will get painfully slow with larger inputs and a lot of nesting. I solved that by transforming the input to some XML representation, using the DOM parser to do the actual logic, and then converting the XML back to my standard representation (LaTeX). Processing time dropped from ~30s to ~200ms. Totally worth the complexity when you're not willing to write your own lexer/parser. (which would become even more complicated) – Kaii Sep 25 '16 at 13:48
  • @Kaii: Note that with the pattern I suggest, you can't reach the backtrack limit, since quantifiers are possessive or subpattern are enclosed in atomic groups (or groups acting as atomic groups). Other thing since the pattern starts with literal characters, a fast algorithm is used to select positions where the pattern will be tested. The `(*SKIP)` verb avoids to retry already tested substrings. Whatever the size of the string, the search is very fast. When you reach the backtrack limit (or when the pattern takes too much time) only the pattern design is in cause not the tool itself. – Casimir et Hippolyte Sep 25 '16 at 14:48
  • @Kaii: Your idea (converting to XML) is interesting if you want to edit the tree, but it adds one more step if you only want to produce a document. – Casimir et Hippolyte Sep 25 '16 at 14:48
  • just remembered that i posted problem and solution here on stackoverflow, see http://stackoverflow.com/questions/20903722 for details – Kaii Sep 25 '16 at 18:33
  • When I test this regex against `{@if-statement:sub%statement2| ...` it says no match. – instead Sep 25 '16 at 19:24
  • @instead: indeed, but you didn't precise that the sub part can be after each statement, currently the "sub" part is only allowed after the last statement, you can easily correct that. Replace : `(? \w+ (?: % \w+ )* ) (?: : (? \w+ ) )?` with `(? \w+ (?: : \w+ )? (?: % \w+ (?: : \w+ )? )* )` – Casimir et Hippolyte Sep 25 '16 at 19:40
  • Yes I didn't precise, sorry for that. Now it's do the job. I'll check this out in a while if it works as I need to. – instead Sep 25 '16 at 19:49
  • Ok, I tested this. I'ts very nice solution also thank You for advices, but @revo solution is what I really need - not so deep matching. Aprreciate it, +1 – instead Sep 25 '16 at 23:42