1

I'm using the node.js dom-parser which (unideally) pulls tags out of the DOM using regular expressions.

You can find dom-parser at: https://github.com/ershov-konst/dom-parser

Occasionally, the HTML of some web pages (eg https://www.ecosia.org/ ) causes the node.js app to hang. I've tested using a plain vanilla matching script and found that tagRegExp causes the script to hang (perhaps because of catastrophic backtracking?)

I'm actually using it to find link rel="canonical" and a href="xyz" (if any, ecosia has no canonical).

tagRegExp:

/(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:(?:'[\s\S]*?')|(?:"[\s\S]*?")))*\s*\/?>)|([^<]|<(?![a-z\/]))*/gi

Pure JS test script:

<script type="text/javascript">
var text = '... html source ...';
var text_esc = text
text_esc = text_esc.replace(/\</g, "&lt;");
text_esc = text_esc.replace(/\>/g, "&gt;");
var regex = /(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:(?:'[\s\S]*?')|(?:"[\s\S]*?")))*\s*\/?>)|([^<]|<(?![a-z\/]))*/gi;
var found = text.match(regex);
var found_len = found.length;

document.write("Text: " + text_esc + "<br /><br />" + "Regex pattern: " + regex + "<br /><br />");

document.write("Matches: " + found_len + "<br /><br />");

for (var i=0;i<found_len;i++)
{
    found[i] = found[i].replace(/\</g, "&lt;");
    found[i] = found[i].replace(/\>/g, "&gt;");

    document.write("[" + i + "]: " + found[i] + "<br /><br />");
}
</script>

Any ideas most welcome. Thanks in advance.

Eric Twose
  • 173
  • 2
  • 9

1 Answers1

1

The problem is caused by the [\s\S]*? patterns and an inefficient (x|[^x])* like pattern.

You may use

/(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:'[^']*'|"[^"]*"))*\s*\/?>)|[^<]*(?:<(?![a-z\/])[^<]*)*/gi

The '[\s\S]*?' is turned into '[^']*' where [^']* is a greedily quantified negated character class matching any char other than ' and "[\s\S]*?" is dealt with in the same way. A negated character class is better than .*? lazy counterpart as it matches all chars other than the specified one(s) in one go and the regex engine does not have to try out all subsequent subpatterns after this pattern and then expand each time upon a failure.

The ([^<]|<(?![a-z\/]))* can be unrolled as [^<]*(?:<(?![a-z\/])[^<]*)*, it will match the same texts but quicker (same as before, the negated character class patterns with greedy quantifiers go through the text faster).

Note I also removed a couple of redundant non-capturing groups.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563