2

I know there are tons of questions similar to this, but this is specific to my regular expression. I'm trying to see if a string has any html in it.

Regex tagRegex = new Regex(@"<\s*([^ >]+)[^>]*>.*?<\s*/\s*\1\s*>")
if(tagRegex.IsMatch(body))
{
  *do something*
}

but it is failing at the IsMatch part due to catastrophic backtracking. Can anyone tell me what's the issue with the regular expression?

Thank you

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
David
  • 43
  • 1
  • 9
  • 1
    Yes, this segment `( [^ >]+ )` Blows past tags like `` or `` then procedes to go to the end trying to fit `\1` into it. Its a slow process. –  Dec 19 '19 at 17:55
  • @x15: So what is the best way to fix this? – David Dec 19 '19 at 17:56
  • 1
    Are you looking to match _invisible content_ or just open / close tags ? –  Dec 19 '19 at 17:56
  • 1
    Relevant: https://stackoverflow.com/questions/15458876/check-if-a-string-is-html-or-not/15458987 – Joel Wiklund Dec 19 '19 at 17:56
  • 1
    I would recommend using this for invisible content `<(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*>` –  Dec 19 '19 at 18:02
  • @X15: I just want to check if a string has any HTML content in it, if not, I'll have to manually convert the text to html. If it has any html content, I don't have to do anything – David Dec 19 '19 at 18:03
  • 1
    You can use this `<(?:([\w:]+)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*>` but it will match like ` to ` I would suggest using the next tag only regex to see if it has html in it. See next. –  Dec 19 '19 at 18:05
  • 1
    All html/xml tag parsing, one at a time `<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>` –  Dec 19 '19 at 18:06
  • 1
    @X15: Thank you! I'll try that – David Dec 19 '19 at 18:09
  • Deny of service with backtracking: https://www.meziantou.net/regex-deny-of-service-redos.htm – Damian Dec 30 '19 at 00:52

1 Answers1

0

Use of the * in regular expressions is where most backtracking occurs. It's like saying, "Well there might be a thing there, but there might not...so keep looking". That indecision leads to backtracking to find other alternatives...

The issue with your pattern is that it is trying to do everything and ends up doing nothing due to backtracking. One needs to keep patterns tight by specifying specific things to find and only using * sparingly, if at all.

Shorten your pattern to a one rule and then add more rules to it. It becomes a tradeoff between full compliance and speed; you need to make that call.


See MS Docs Take Charge of Backtracking for more information.

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122