6

I have run up against an odd problem. it appears i am reaching some sort of limit with preg_replace while trying to use two matches using php-5.3.3

// works fine
$pattern_1 = '?START(.*)STOP?';
$string = 'START' . str_repeat('x',9999999) . 'STOP' ;
preg_match($pattern_1, $string , $matchedArray )        ;

$pattern_2 = '?START-ONE(.*)STOP-ONE.*START-TWO(.*)STOP-TWO.*?';

// works fine
$string = 'START-ONE this is head stuff STOP-ONE  START-TWO' . str_repeat('x', 49970) . 'STOP-TWO' ;
preg_match($pattern_2, $string , $matchedArray_2 )      ;

// didnt work
$string = 'START-ONE this is head stuff STOP-ONE  START-TWO' . str_repeat('x', 49971) . 'STOP-TWO' ;
preg_match($pattern_2, $string , $matchedArray_3 )      ;

The first option with only one match uses a very large string and has no problems.

The second option has a string length of 50,026 and works fine. the last option has a string length of 50,027 (one more) and the match no longer works. since the 49971 number can vary when the error occurs, it could be changed to something larger to simulate the problem.

Any ideas or thoughts? perhaps is this a php version issue? maybe a possible workaround is merely to only use one match rather than two and then run preg_match it twice ?

edwardsmarkf
  • 1,387
  • 2
  • 16
  • 31

1 Answers1

4

Ok, PHP's not very talkative about regex errors, it just returns false for the last case, which simply tells than an error occured, per the PHP docs.

I've reproduced the problem using PCRE (the regex engine used by preg_match) in C# (but with a much higher character count), and the error I'm getting is PCRE_ERROR_MATCHLIMIT.

This means you're hitting the backtracking limit set in PCRE. It's just a safety measure to prevent the engine from looping indefinitely, and I think your PHP configuration sets it to a low value.

To fix the issue, you can set a higher value for the pcre.backtrack_limit PHP option which controls this limit:

ini_set("pcre.backtrack_limit", "10000000"); // Actually, this is PCRE's default

On a side note:

  • You probably should replace (.*) with (.*?) to get less useless backtracking and for correctness (otherwise the regex engine will get past the STOP string and will have to backtrack to reach it)
  • Using ? as a pattern delimiter is a bad idea since it prevents you from using the ? metacharacter and therefore applying the above advice. Really, you should never use regex metacharacters as pattern delimiters.

If you're interested in more low-level details, here's the relevant bit of the PCRE docs (emphasis mine):

The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match, but which have a very large number of possibilities in their search trees. The classic example is a pattern that uses nested unlimited repeats.

Internally, pcre_exec() uses a function called match(), which it calls repeatedly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string.

When pcre_exec() is called with a pattern that was successfully studied with a JIT option, the way that the matching is executed is entirely different. However, there is still the possibility of runaway matching that goes on for a very long time, and so the match_limit value is also used in this case (but in a different way) to limit how long the matching can continue.

The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.

A value for the match limit may also be supplied by an item at the start of a pattern of the form

 (*LIMIT_MATCH=d)

where d is a decimal number. However, such a setting is ignored unless d is less than the limit set by the caller of pcre_exec() or, if no such limit is set, less than the default.

Community
  • 1
  • 1
Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • Your point about avoiding metacharacters is **excellent**, thank you for pointing that out to me. i am actually reading in webpages and trying to parse out the body section hence the need for the parenthesis (but i bet you are going to tell me that php has a function for just that). it always feels a little strange to change the php.ini just for one program. i am now just using two preg_matches instead but will try your suggestion soon. thanks again. – edwardsmarkf Jan 09 '15 at 22:29
  • 1
    You're welcome. And yes you should probably use [better tools](http://stackoverflow.com/a/3577662/3764814) since parsing HTML with regex [is not for everyone](http://stackoverflow.com/a/1732454/3764814) :) – Lucas Trzesniewski Jan 09 '15 at 23:34
  • Also, `ini_set` changes the value only for the current request, it's not permanent (it doesn't change php.ini). So you can go ahead and use it. – Lucas Trzesniewski Jan 09 '15 at 23:39
  • sorry i misunderstood you about using int_set not php.ini - good idea (once again). – edwardsmarkf Jan 10 '15 at 23:02