4

I'm trying to parse some web pages with preg_match_all() and some of them are quite large as several MBs in size. And one of the regular expressions matches some text strings that are so large that they don't seem to be able to match and acquire them. It simply returns an empty string.

One of the strings is 1.32MB or 1,393,557 bytes when I manually selected it and saved it as a .txt file.

When the string is much shorter as just tens of thousands of bytes, that regular expression successfully matches and acquires it.

So my question is, as it occurs to me there's a limit / maximum length of string preg_match_all() can match, what is it and how can I set it larger?

datasn.io
  • 12,564
  • 28
  • 113
  • 154
  • Why not use DOM parser to parse the web page. – anubhava Aug 18 '13 at 06:46
  • @anubhava, because web pages can be non-compliant? – datasn.io Aug 18 '13 at 06:48
  • I just tested to match \d in 27.6 MB and it's just worked fine.. –  Aug 18 '13 at 06:49
  • 2
    That is more of a reason to use DOM since there are more chances that regex will break and give unexpected results. – anubhava Aug 18 '13 at 06:50
  • 1
    @kavoir.com: You think a regex is going to handle broken HTML any better? Regexes suck with languages like HTML as it is. PCRE's only saving grace is that it allows recursion...but even with that, it can get hairy if the HTML is wacky or the regex isn't very carefully written to avoid matching nothing over and over again. – cHao Aug 18 '13 at 06:52
  • @cHao, thanks for the reasons, I think I'll take a look into DOM parser for the job. Is there any pages on stackoverflow.com that deals with this issue? So I can get more input on which to use. Do you have any working examples where DOM parser is preferrable over regex? – datasn.io Aug 18 '13 at 07:11
  • 1
    @kavoir.com: Any time you can't guarantee the prettiness of the HTML you're being handed, an honest-to-goodness HTML parser will generally make more sense of it than a regex will. For a pathological example, let's take ` – cHao Aug 18 '13 at 08:27
  • @cHao, Thank you! I'll take on DOM parser for the job from now on, probably combined with regex where I see handy. – datasn.io Aug 19 '13 at 06:21

2 Answers2

11

Set the ini_set('pcre.backtrack_limit', '1048576'); to whatever you want in your script or on your php.ini file for global use. (example is 1mb)

Credit to: http://www.karlrixon.co.uk/writing/php-regular-expression-fails-silently-on-long-strings/

probablyup
  • 7,636
  • 1
  • 27
  • 40
  • 1
    I actually increased it to 10485760 (10 times your value) to solve the problem. Would this impose any memory problems? The larger, the slower or something? – datasn.io Aug 18 '13 at 07:07
  • 3
    @kavoir.com - Memory problems? Yes! Be very careful here. Applying certain common regex patterns to long subject strings can easily _crash_ PHP and/or the apache webserver executable. (Not just fail to match, but actually generate a seg-fault due to a stack overflow.) This happens because PHP sets: `pcre.recursion_limit` way above the recommended limit for the PCRE library. See my answer to [RegExp in preg_match function returning browser error](http://stackoverflow.com/a/7627962/433790). Good luck. – ridgerunner Aug 18 '13 at 14:25
  • 1
    Unless you have very little memory available (<256mb), I wouldn't worry about it. Most defaults in PHP are intentionally too low to be memory conservative. – probablyup Aug 19 '13 at 15:16
0

I tried increasing 'pcre.backtrack_limit' and 'pcre.recursion_limit' but neither of these solved my problem. In my case, the solution was to avoid backtracking altogether by using the possessive quantifier +.

The section on Quantifiers in The Stack Overflow Regular Expressions FAQ may be helpful if anybody else stumbles onto this.

Paul
  • 86
  • 6