Well, Alan Moore's answer is good, but I would modify it a bit to make it more compact. For the regex compiler:
"([^"\\]*(\\.)*)*"
Compare with Alan Moore's expression:
"[^"\\]*(\\.[^"\\]*)*"
The explanation is very similar to Alan Moore's one:
The first part "
matches a quotation mark.
The second part [^"\\]*
matches zero or more of any characters other than quotation marks or backslashes.
And the last part (\\.)*
matches backslash and whatever single character follows it. Pay attention on the *, saying that this group is optional.
The parts described, along with the final "
(i.e. "[^"\\]*(\\.)*"
), will match: "Some Text" and "Even more Text\"", but will not match: "Even more text about \"this text\"".
To make it possible, we need the part: [^"\\]*(\\.)*
gets repeated as many times as necessary until an unescaped quotation mark turns up (or it reaches the end of the string and the match attempt fails). So I wrapped that part by brackets and added an asterisk. Now it matches: "Some Text", "Even more Text\"", "Even more text about \"this text\"" and "Hello\\".
In C# code it will look like:
var r = new Regex("\"([^\"\\\\]*(\\\\.)*)*\"");
BTW, the order of the two main parts: [^"\\]*
and (\\.)*
does not matter. You can write:
"([^"\\]*(\\.)*)*"
or
"((\\.)*[^"\\]*)*"
The result will be the same.
Now we need to solve another problem: \"foo\"-"bar"
. The current expression will match to "foo\"-"
, but we want to match it to "bar"
. I don't know
why would there be escaped quotes outside of quoted strings
but we can implement it easily by adding the following part to the beginning:(\G|[^\\])
. It says that we want the match start at the point where the previous match ended or after any character except backslash. Why do we need \G
? This is for the following case, for example: "a""b"
.
Note that (\G|[^\\])"([^"\\]*(\\.)*)*"
matches -"bar"
in \"foo\"-"bar"
. So, to get only "bar"
, we need to specify the group and optionally give it a name, for example "MyGroup". Then C# code will look like:
[TestMethod]
public void RegExTest()
{
//Regex compiler: (?:\G|[^\\])(?<MyGroup>"(?:[^"\\]*(?:\.)*)*")
string pattern = "(?:\\G|[^\\\\])(?<MyGroup>\"(?:[^\"\\\\]*(?:\\\\.)*)*\")";
var r = new Regex(pattern, RegexOptions.IgnoreCase);
//Human readable form: "Some Text" and "Even more Text\"" "Even more text about \"this text\"" "Hello\\" \"foo\" - "bar" "a" "b" c "d"
string inputWithQuotedText = "\"Some Text\" and \"Even more Text\\\"\" \"Even more text about \\\"this text\\\"\" \"Hello\\\\\" \\\"foo\\\"-\"bar\" \"a\"\"b\"c\"d\"";
var quotedList = new List<string>();
for (Match m = r.Match(inputWithQuotedText); m.Success; m = m.NextMatch())
quotedList.Add(m.Groups["MyGroup"].Value);
Assert.AreEqual(8, quotedList.Count);
Assert.AreEqual("\"Some Text\"", quotedList[0]);
Assert.AreEqual("\"Even more Text\\\"\"", quotedList[1]);
Assert.AreEqual("\"Even more text about \\\"this text\\\"\"", quotedList[2]);
Assert.AreEqual("\"Hello\\\\\"", quotedList[3]);
Assert.AreEqual("\"bar\"", quotedList[4]);
Assert.AreEqual("\"a\"", quotedList[5]);
Assert.AreEqual("\"b\"", quotedList[6]);
Assert.AreEqual("\"d\"", quotedList[7]);
}