2

I have following code:

void Main()
{
  string template = @"
aaa 
{begin iteration items} 
  bbbbbb 
  {begin iteration subitems} 
    ccccccc 
  {end iteration subitems} 
  ddddddddd 
  {begin iteration items} 
    hhhhhhhhhhhhhhhhh
  {end iteration items} 
  iiiiiiiiiiiiiiiiiiiiiiiiiiii
{end iteration items} 
eeeeeeeeeeeeeeee
{begin iteration items} 
  ffffff
{end iteration items} 
gggggggggggg
  ";

  string re = @"
\{\s*begin\s+iteration\s+items\s*}
(?<template>
  (
    (?<iteration>\{\s*begin\s+iteration\s+items\s*})
    |(?<-iteration>\{\s*end\s+iteration\s+items\s*})
    |((?!(\{\s*begin\s+iteration\s+items\s*})|(\{\s*end\s+iteration\s+items\s*})).*?)
  )*(?(iteration)(?!))
)
\{\s*end\s+iteration\s+items\s*}
  ";

  Regex r = new Regex(re, RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
  var matches = r.Matches(template);
  matches.Dump();
}

When template is balanced then matches are returned and all is ok. But when I change {end iteration items} to {end1 iteration items} after iiiiiiiiiiiiiii line in template, then code stops to respond on matches.Dump() line (Dump() is an extension method to read/enumerate in LinQPad)

What is wrong? Is it possible to rewrite Regex so that it always respond?

EDIT My goal is to capture all top level <template> groups if syntax is valid, or capture nothing if not. I tried non-backtracking groups as Lucas adviced, but there no captures now when syntax is valid.

shibormot
  • 1,638
  • 2
  • 12
  • 23
  • The most common reason for a regex to stop responding is catastrophic backtracking. I am not sure how to fix it, but it looks like you've got a language with some complexity going on, so you may benefit from building a parser. It is easier than it sounds if you use [a proper tool](http://stackoverflow.com/q/1194584/335858). – Sergey Kalinichenko Jul 31 '15 at 14:03
  • One fixes catastrophic backtracking by avoiding the use of `*` where possible. The zero, in zero or more, is the primary culprit. If one knows there will be items, don't use zero, use one `+` or more. – ΩmegaMan Jul 31 '15 at 14:08
  • @dasblinkenlight thank you, i'll look at it, but i want to be convinced that it not possible with regexes – shibormot Aug 03 '15 at 07:23

1 Answers1

2

You're experiencing catastrophic backtracking here.

In short: a pattern in the form of ((something)*)* with nested quantifiers will trigger it, because the engine has to try all the possible combinations if a match can't be found right away.

You can use an atomic group to guard against it. The following should do the trick:

\{\s*begin\s+iteration\s+items\s*}
(?<template>
  (?>
    (?<iteration>\{\s*begin\s+iteration\s+items\s*})
    |(?<-iteration>\{\s*end\s+iteration\s+items\s*})
    |[^{]+
    |\{
  )*(?(iteration)(?!))
)
\{\s*end\s+iteration\s+items\s*}

Or use ((?>...)) instead of (?>...) if you need capturing.

I simplified the expression - there's no need for the lookahead anymore when using an atomic group, since these cases would be handled by the iteration groups. The last part of the alternative (\{) is here to account for lone opening braces, which are not part of the begin/end sequence. Most of the text is consumed by [^{]+ inside the atomic group, so backtracking cannot happen.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • Thank you, i'll look a bit later, need to run) – shibormot Jul 31 '15 at 16:00
  • I tried this, it always respond now, but when input string is valid there are no matches (. In both cases, when using `(?>...)` and when using `((?>...))` on `(? – shibormot Aug 03 '15 at 07:18
  • @shibormot I should have tested this, sorry about that. The `*?` didn't play very well inside of `(?>)`. I've fixed (and simplified) the pattern. – Lucas Trzesniewski Aug 03 '15 at 17:18
  • @Lucacs thank you for your work. it looks like new RegEx accepts valid strings, but it also accepts invalid strings too. My bad, i didn't clearly explain, that valid syntax is when there are no unmatched begin/end tags :) But it is another question. I will post it later. Because of time I start writing this piece of code manually. Thank you again for showing this thechnique ) – shibormot Aug 04 '15 at 09:52