Escape catastrophic backtracking in HTML markup

Question

Like I said in the title, my data set is markup and it looks somewhat like this

<!DOCTYPE html>
<html>
<head>
    <title>page</title>
</head>
<body>
<main>

<div class="menu">
    <img src=mmayboy.jpg>
    <p> stackoverflow is good </p>
</div>

<div class="combine">
    <p> i have suffered <span>7</span></p>
</div>
</main>
</body>
</html>

And my regex engine tries to match each of the following node blocks separately i.e I can attempt to match combine or menu. In one shot, this is what my regex engine looks like, although I dived into its internals just below it.

/(<div class="menu">(\s+.*)+<\/div>(?:(?=(\s+<div))))/

It attempts to dive into that markup and grabs the desired node block. That is all. As for the internals, here we go

/
(
 <div class="menu"> // match text that begins with these literals
  (
   \s+.*
  )+ /* match any white space or character after previous. But the problem is that this matches up till the closing tag of other DIVs i.e greedy. */
  <\/div> // stop at the next closing DIV (this catches the last DIV)
  (?: // begin non-capturing group 
   (?=
    (
     \s+<div
     ) /* I'm using the positive lookahead to make sure previous match is not followed by a space and a new DIV tag. This is where the catastrophic backtracking is raised. */
   )
  )
 )
/

I've indented it with comments to aid anyone willing to help. I have also scouted for solution from blogs and the manual they say it's caused by an expression having too many possibilities and can be remedied by reducing the chances of outcomes i.e +? instead of * but as hard as I've tried, I'm unable to apply any of it to my current dilemma.

[*Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms*](http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la?s=1|3.4660). For balance: [*Oh Yes You Can Use Regexes to Parse HTML!*](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491): *… optimal for small HTML parsing problems, pessimal for large ones*. — RobG, Mar 29 '17 at 01:03

Ry- · Accepted Answer · 2017-03-29T01:43:47.343

1

(\s+.*)+

can probably be simplified to just

[^]*?

which should prevent catastrophic backtracking. Overall simplification:

/<div class="menu">[^]*?<\/div>/

Have you considered using an HTML parser instead, though?

var parser = new DOMParser();
var doc = parser.parseFromString(data, 'text/html');
var menu = doc.getElementsByClassName('menu')[0];

console.log(menu.innerHTML);

edited Mar 29 '17 at 01:43

answered Mar 29 '17 at 00:24

Ry-

218,210
55
464
476

Yes, use an HTML parser. – RobG Mar 29 '17 at 01:05
The "simplified" regex cannot work because there's plenty space characters before alphanumerics and symbols. Secondly, DOMParser is still experimental and as such, isn't yet available on nodejs (which happens to be where I need it) – I Want Answers Mar 29 '17 at 01:32
@Mmayboy: Ah, sorry, regex updated. But use [an HTML parser for Node](https://www.npmjs.com/package/cheerio), then. (And that’s not why it isn’t available on Node.) – Ry- Mar 29 '17 at 01:43
You should update it again from [^]*? To [^.]*? :p Do I still need a node parser module when this tiny Dom element is all I want? Better still, do you mind explaining how this manages to work? – I Want Answers Mar 29 '17 at 01:55
@Mmayboy: Er, which part needs to be updated again? And yes, if you want reliable HTML parsing, you should use an HTML parser. If another `
` ever ends up inside `
`, you’ll find it’s impossible to use regular expressions to extract the single element, for example.
– Ry- Mar 29 '17 at 01:56
@Ryan I just discovered full stops are taken literally in a character class. So the regex will fail woefully if a dot ever strays into the desired markup. Even though it was smart on your part to notice I had no dot in the dataset, it's still nothing short of a miracle that this works seamlessly – I Want Answers Mar 29 '17 at 02:27
@Mmayboy: Eh? `[^]` matches a dot as well. It’s not a typo of `[^.]`. – Ry- Mar 29 '17 at 02:28
The error it gives is of an incomplete character class. It doesn't parse or else I add something to it. The dot worked and seemed logical since I had no dots in the string – I Want Answers Mar 29 '17 at 09:44
@Mmayboy: The error what gives? Something other than JavaScript? `/[^]/` is a valid JavaScript regular expression. If you’re using it somewhere else, `[\s\S]` instead, or `.` and pass a DOTALL flag. – Ry- Mar 29 '17 at 16:19

Escape catastrophic backtracking in HTML markup

1 Answers1