1

I'm sometimes having RegexMatchTimeoutException when parsing a short (less than 100 characters) string. The parse itself is inside a function in a list.Select(..) of a collection of about 30 elements.

I suspect it may be due to sub-optimal Regex - here's the definition in C#:

internal override Regex Regex => new(
    @$"((.|\s)*\S(.|\s)*)(\[{this.Type}\])",             // Type = "input"
    RegexOptions.Multiline | RegexOptions.Compiled, 
    TimeSpan.FromMilliseconds(Constants.RegexTimeout));  // RegexTimeout = 100

It should capture Sample text in the following string:

Sample text
[input]

Full exception message:

System.Text.RegularExpressions.RegexMatchTimeoutException: 'The Regex engine has timed out while trying to match a pattern to an input string. This can occur for many reasons, including very large inputs or excessive backtracking caused by nested quantifiers, back-references and other factors.'

Line in which the exception occurs:

var label = this.Regex.Match(sectionContent).Groups[1].Value.Trim();

The exception is rather hard to reproduce - with the same input it can happen on the first run or on the 100th. But the bigger the collection of lines to run the Regex against, the bigger the chance of it occurring.

Lemur
  • 2,659
  • 4
  • 26
  • 41
  • 1
    The regex can be simplified a bit: `.|\s` is the same as `.` because `.` includes space characters. – Good Night Nerd Pride Jan 16 '23 at 16:32
  • Could you show us some example strings that you are matching against? – Good Night Nerd Pride Jan 16 '23 at 16:35
  • 1
    The pattern is too messy, it can be simply re-written as `@$"(?s)\s*\S.*?(\[{this.Type}])"` - and remove the unnecessary `RegexOptions.Multiline`. – Wiktor Stribiżew Jan 16 '23 at 16:36
  • @GoodNightNerdPride - it's in the 2nd block of code - below Regex one – Lemur Jan 16 '23 at 17:33
  • 1
    Your regex clearly has problems given the `(.|\s)*` clauses that match to the end on every failure/backtrack attempt. It's not appropriate to guess your intent in this case as @WiktorStribiżew does. You should take the advise and give a full sample that is causing the problem to get a real solution. – sln Jan 16 '23 at 20:05
  • We also need to know what you are trying to extract. To me it seems like you want to extract all the text in a line that is followed by the line `$"[{this.Type}]"`. But this is easier and more efficiently done with `string.Split()`. – Good Night Nerd Pride Jan 17 '23 at 08:39
  • Generally you should inlcude more details on the nature of the problem you are trying to solve. Maybe your solution is the wrong way to go anyway. – Good Night Nerd Pride Jan 17 '23 at 08:41
  • Did the answer below address your issue? Please let know if you need more assistance. – Wiktor Stribiżew Jan 22 '23 at 09:41

1 Answers1

1

Your ((.|\s)*\S(.|\s)*)(\[input\]) regex matches

  • ((.|\s)*\S(.|\s)*) - Group 1:
    • (.|\s)* - zero or more occurrences of any char other than a newline (.) or (|) any whitespace char (\s)
    • \S - a non-whitespace chars
    • (.|\s)* - zero or more occurrences of any char other than a newline (.) or (|) any whitespace char (\s)
  • (\[input\]) - Group 2: [input].

You can't but notice that Group 1 patterns each can match the same characters. \S is the only "anchoring" pattern here, it requires a single non-whitespace char, and since both patterns before and after \S are meant to match any text, the most efficient logic is: match any amount of whitespaces, then a non-whitespace char, and then any amount of chars (as few as possible but as many as necessary) up to [input].

Here is the fix:

internal override Regex Regex => new(
    @$"(?s)(\s*\S.*?)(\[{Regex.Escape(this.Type)}])",             // Type = "input"
    RegexOptions.Compiled, 
    TimeSpan.FromMilliseconds(Constants.RegexTimeout));  // RegexTimeout = 100`

Note the this.Type can be escaped just in case there are any special chars in it. (?s) is an inline modifier version of the RegexOptions.Singleline option (use them interchangeably).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563