1

I'm trying to scrape a series of webpages using PHP, grabbing all of the content between the tag and the earliest tag. This is the regex that I'm using:

|(?<=div id="body">).*?</div>|s

This seems to be working perfectly fine for most of the pages I'm looking at. However, it's not returning anything for a few others. I plugged the regex into the regex101.com tester, and it told me that the problem was with catastrophic backtracking. I tried removing the lookbehind language, and even playing around with things like:

|id="body">.*?</div>|s

However, the problem is still persisting. I've looked at some other questions about catastrophic backtracking, as well as the http://www.regular-expressions.info/catastrophic.html article, but I can't figure out how to apply their fixes to this particular case.

EAP
  • 13
  • 3
  • 1
    I don't see that falling into catastrophic backtracking. Can you link us to the regex101 example you tried? – Mariano Nov 03 '15 at 19:48
  • Sure. https://regex101.com/r/kY8qK0/1 – EAP Nov 03 '15 at 19:51
  • Why do you need a lookbehind? `div id="body">.*?` works just as well. –  Nov 03 '15 at 19:53
  • Since that is a fixed width lookbehind and the end is a literal, the `.*?` should not cause backtrack problems. –  Nov 03 '15 at 19:57
  • Even without the lookbehind (i.e., using your version), the regex throws a catastrophic backtracking error. – EAP Nov 03 '15 at 19:57
  • I think its the length of the match text. It matches `len 105695` when I run it from [regexformat7 app](http://www.regexformat.com) –  Nov 03 '15 at 20:04
  • This isn't catastrophic backtracking. It's only a timeout. It takes too long to process a match from char 3,962 to char 108,334. I guess regex101 assumes it's due to a catastrophic backtracking (logically, they didn't expect a user entering a *catastrophically* long text :-) ... That said, don't use regex to parse HTML – Mariano Nov 03 '15 at 20:05
  • Ah; thanks! If not regex, then, how should I parse HTML? – EAP Nov 03 '15 at 20:05
  • It's a catastrophically long match. –  Nov 03 '15 at 20:06
  • 1
    [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/q/3577641/5290909) – Mariano Nov 03 '15 at 20:07

1 Answers1

0

Regular expressions are known to cause catastrophic backtracking with large HTML contents. In this case, the problem is surely with the look-behind and lazy dot matching, when each time the regex engine advances one symbol to the right, it must check if the symbol is preceded with the specified substring, and if it reached enough characters to yield a match.

A good idea of how this regex works is looking at the regex101 regex debugger section.

As to how to parse your HTML, PHP DOMDocument and DOMXPath are your best friends:

$html = "<<YOUR_HTML_STRING_HERE>>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Above is the DOM initialization from string example, below is parsing
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div[@id="body"]'); // Get all DIV tags with id=body

foreach($divs as $div) { 
  echo $dom->saveHTML($div); // Echo the HTML, can be added to array
}

See IDEONE demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thank you very much! This is exactly what I needed. – EAP Nov 03 '15 at 20:48
  • @stribizhev Are you positive this is *catastrophic backtracking*, as the problem isn't really with exponential backtracking steps as it would occur with nested quantifiers? -Notice it raises the same error without the lookbehind, which I believe it's a plain old *O(n)* timeout. – Mariano Nov 03 '15 at 20:53
  • @Mariano: I use the term catastrophic backtracking widely. Timeout is caused by a huge amount of backtracking steps. – Wiktor Stribiżew Nov 03 '15 at 21:04
  • I see. And I don't know if there's a formal definition either. I'll look into it tomorrow – Mariano Nov 03 '15 at 21:07
  • I wouldn't say it's catastrophic backtracking. For an average HTML document, this is at most a linear operation. However, in a long HTML document, the backtracking on `.*?` to advance to the next character exhausts the backtracking limit, leading to the match being interrupted. – nhahtdh Nov 04 '15 at 05:03