2

I've once answered a question about matching a quoted string with escaped quotes.

It seems that there are cases that hang on .NET and crash on Mono (with OutOfMemoryException), for example:

var reg = new Regex(@"^""([^\\""]*(\\"")*(\\[^""])*)*""");
reg.Match("\"                               ");

Two questions:

1) why does this happen?

2) how to improve this regex? I want it to preserve all the "features".

Community
  • 1
  • 1
Piotr Zierhoffer
  • 5,005
  • 1
  • 38
  • 59
  • 2
    This is probably a case of [catastrophic backtracking](http://www.regular-expressions.info/catastrophic.html). The question you linked has a better answer. – Tim S. Nov 25 '13 at 15:32
  • FYI, Your expression works for me using Javascript's Regex engine: http://www.rexfiddle.net/slfowcR – qJake Nov 25 '13 at 15:33
  • Just an idea - why do you actually have to do this using a regular expression? Wouldn't it be easier to write a simple state machine to parse the string? You would get around all that nasty backtracking that makes this exponentially expensive for cases like the one you've supplied. – Luaan Nov 25 '13 at 16:01
  • @Luaan the point is regexes are ugly, but quite easy to maintain if you have many of them. Creating several state machines to handle many cases would be tiresome. Maybe it'd be worth doing, when the grammar of my input will stabilize and I have time. – Piotr Zierhoffer Nov 25 '13 at 16:06
  • @TimS. I've tried to avoid using ?: because I don't understand it very well and wanted to create my own solution. Apparently, not a very good idea :) Actually, your comment with this link makes a good answer, thank you. – Piotr Zierhoffer Nov 25 '13 at 16:08

1 Answers1

0

It is indeed (cf Tim S.) a catastrophic backtracking with this part of your pattern: (\\[^""])*)* which allows all that is possible in the world and cause the regex engine to try too many possibilities. (Better illustrations can be found if you follow Tim S. link)

An other pattern to do this:

var reg = new Regex(@"(?s)""(?>[^\\""]+|\\{2}|\\.)*""");

(The idea is to match all even number of backslashes (second part of the alternation) before to be sure to allow real escaped characters (third part of the alternation).)

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • I don't think there is anything about even number of backslashes in my regex, it just has to catch everything between unescaped quotes. It's a bit excessive, to be sure. As Tim S. provided only a comment, I'll just accept your answer, which is correct as well :) Thanks – Piotr Zierhoffer Nov 25 '13 at 18:19
  • @PiotrZierhoffer: note that you must add capturing parenthesis to extract the content you want (or use a couple lookbehind/lookahead to enclose double quotes). About the even number of backslashes, consider a string like this: `abc "def" "g\"hi" "jkl\\" "mn\\\"o"`. The result must be: `def`, `g"hi`, `jkl\\`, `mn\"o`. – Casimir et Hippolyte Nov 25 '13 at 18:28