45

I'm trying to find all of the quoted text on a single line.

Example:

"Some Text"
"Some more Text"
"Even more text about \"this text\""

I need to get:

  • "Some Text"
  • "Some more Text"
  • "Even more text about \"this text\""

\"[^\"\r]*\" gives me everything except for the last one, because of the escaped quotes.

I have read about \"[^\"\\]*(?:\\.[^\"\\]*)*\" working, but I get an error at run time:

parsing ""[^"\]*(?:\.[^"\]*)*"" - Unterminated [] set.

How do I fix this?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Joshua Lowry
  • 1,075
  • 3
  • 11
  • 30

11 Answers11

85

What you've got there is an example of Friedl's "unrolled loop" technique, but you seem to have some confusion about how to express it as a string literal. Here's how it should look to the regex compiler:

"[^"\\]*(?:\\.[^"\\]*)*"

The initial "[^"\\]* matches a quotation mark followed by zero or more of any characters other than quotation marks or backslashes. That part alone, along with the final ", will match a simple quoted string with no embedded escape sequences, like "this" or "".

If it does encounter a backslash, \\. consumes the backslash and whatever follows it, and [^"\\]* (again) consumes everything up to the next backslash or quotation mark. That part gets repeated as many times as necessary until an unescaped quotation mark turns up (or it reaches the end of the string and the match attempt fails).

Note that this will match "foo\"- in \"foo\"-"bar". That may seem to expose a flaw in the regex, but it doesn't; it's the input that's invalid. The goal was to match quoted strings, optionally containing backslash-escaped quotes, embedded in other text--why would there be escaped quotes outside of quoted strings? If you really need to support that, you have a much more complex problem, requiring a very different approach.

As I said, the above is how the regex should look to the regex compiler. But you're writing it in the form of a string literal, and those tend to treat certain characters specially--i.e., backslashes and quotation marks. Fortunately, C#'s verbatim strings save you the hassle of having to double-escape backslashes; you just have to escape each quotation mark with another quotation mark:

Regex r = new Regex(@"""[^""\\]*(?:\\.[^""\\]*)*""");

So the rule is double quotation marks for the C# compiler and double backslashes for the regex compiler--nice and easy. This particular regex may look a little awkward, with the three quotation marks at either end, but consider the alternative:

Regex r = new Regex("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"");

In Java, you always have to write them that way. :-(

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • I like this explanation best. – Joshua Lowry Jan 28 '10 at 15:36
  • Cruising through some of the answers that made you famous... Upvoting this one for making such a clear explanation out of the worst backslash soup! :) – zx81 May 06 '14 at 06:19
  • It is possible to same thing with Single quote (') – Kalpesh Rajai Feb 05 '16 at 07:43
  • @KalpeshRajai: Sure, just replace the double-quotes with single-quotes in my first regex. You don't even need to escape them (unless you're using a single-quoted string literal, which C# doesn't support). – Alan Moore Feb 05 '16 at 12:31
  • @AlanMoore: Thank You – Kalpesh Rajai Feb 05 '16 at 13:22
  • I would upvote this 100 times if I could! Your explanation finally helped me accomplish what I was trying to do. Your verbatim regex works as advertised for traditional escape sequences, i.e., a backslash followed by a character (including a quotation mark). However, it does not handle the escape sequence for quotation marks used in verbatim strings, i.e., a double quotation mark. But again, your explanation helped me see the answer using alternation, e.g., `@"""[^""\\]*(?:(?:\\.|"""")[^""\\]*)*"""`. Thank you so much! – Matt Davis Jul 10 '21 at 20:51
12

Regex for capturing strings (with \ for character escaping), for the .NET engine:

(?>(?(STR)(?(ESC).(?<-ESC>)|\\(?<ESC>))|(?!))|(?(STR)"(?<-STR>)|"(?<STR>))|(?(STR).|(?!)))+   

Here, a "friendly" version:

(?>                            | especify nonbacktracking
   (?(STR)                     | if (STRING MODE) then
         (?(ESC)               |     if (ESCAPE MODE) then
               .(?<-ESC>)      |          match any char and exits escape mode (pop ESC)
               |               |     else
               \\(?<ESC>)      |          match '\' and enters escape mode (push ESC)
         )                     |     endif
         |                     | else
         (?!)                  |     do nothing (NOP)
   )                           | endif
   |                           | -- OR
   (?(STR)                     | if (STRING MODE) then
         "(?<-STR>)            |     match '"' and exits string mode (pop STR)
         |                     | else
         "(?<STR>)             |     match '"' and enters string mode (push STR)
   )                           | endif
   |                           | -- OR
   (?(STR)                     | if (STRING MODE) then
         .                     |     matches any character
         |                     | else
         (?!)                  |     do nothing (NOP)  
   )                           | endif
)+                             | REPEATS FOR EVERY CHARACTER

Based on http://tomkaminski.com/conditional-constructs-net-regular-expressions examples. It relies in quotes balancing. I use it with great success. Use it with Singleline flag.

To play around with regexes, I recommend Rad Software Regular Expression Designer, which has a nice "Language Elements" tab with quick access to some basic instructions. It's based at .NET's regex engine.

Ricardo Nolde
  • 33,390
  • 4
  • 36
  • 40
4
"(\\"|\\\\|[^"\\])*"

should work. Match either an escaped quote, an escaped backslash, or any other character except a quote or backslash character. Repeat.

In C#:

StringCollection resultList = new StringCollection();
Regex regexObj = new Regex(@"""(\\""|\\\\|[^""\\])*""");
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
} 

Edit: Added escaped backslash to the list to correctly handle "This is a test\\".

Explanation:

First match a quote character.

Then the alternatives are evaluated from left to right. The engine first tries to match an escaped quote. If that doesn't match, it tries an escaped backslash. That way, it can distinguish between "Hello \" string continues" and "String ends here \\".

If either don't match, then anything else is allowed except for a quote or backslash character. Then repeat.

Finally, match the closing quote.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Sorry for editing this post so much. But now I think I've got it elegant enough. And correct, too. I hope. – Tim Pietzcker Jan 27 '10 at 20:05
  • This regex not work with this text: \"Some Text\" Some Text "Some Text", and "Some more Text" an""d "Even more text about \"this text\"" – Kamarey Jan 27 '10 at 20:31
  • This is excellent! I think part of the issue was that I was not using the @ which added more complexity with having to slash all over the place. – Joshua Lowry Jan 27 '10 at 20:38
  • Well, texts that are enclosed in escaped quotes weren't part of the question; neither was doubling as another way of escaping quotes. – Tim Pietzcker Jan 28 '10 at 07:10
  • Sorry Tim, but `"(\\"|\\\\|[^"])*"` is no good. Yes, it matches valid quoted strings very well, but it strays off into the land of [catastrophic backtracking](http://www.regular-expressions.info/catastrophic.html) when presented with a non-match string like: `"\\\\\\\\\\\\\\\\\\\\\\\ ` (The options within an alternation group should be mutually exclusive if you apply a `*` or `+` to it) This regex can match a backslash in more than one way. – ridgerunner Apr 09 '11 at 14:59
  • @ridgerunner: You're right, thanks. I have fixed the regex (by including the backslash in the negated character class). Now your "pathological string" fails in 85 instead of 750.000 steps. – Tim Pietzcker Apr 10 '11 at 09:42
  • 1
    Sorry again, but: `"(\\"|\\\\|[^"\\])*"` does not match: `"\n"` or `"\t"`. The pattern needed here is: `"([^"\\]|\\.)*"` which matches correctly (or better yet: `"([^"\\]++|\\.)*"` if the possessive quantifier is available). But Friedl's unrolled version of this expression is _much_ faster. See Alan's answer. Have you read [MRE3](http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124 "Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl") yet? If not, I know you would enjoy it very much (if you are into regex - which I think you are). – ridgerunner Apr 10 '11 at 15:03
  • Still, i tend to say ([^"\\]|\\.)* is the best answer here. It's the most natural and fully working string, when Friedl's unrolled is about the same, with optimization (but redundency) – 131 Sep 24 '12 at 22:08
3

I recommend getting RegexBuddy. It lets you play around with it until you make sure everything in your test set matches.

As for your problem, I would try four /'s instead of two:

\"[^\"\\\\]*(?:\\.[^\"\\\\]*)*\"
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jason
  • 11,435
  • 24
  • 77
  • 131
  • 1
    One of RegexBuddy's selling points is that it can automatically convert the regex to source code in whatever language you specify. In this case, it converts the "raw" regex `"[^"\\]*(?:\\.[^"\\]*)*"` to `@"""[^""\\]*(?:\\.[^""\\]*)*"""`. – Alan Moore Jan 28 '10 at 01:43
2

Well, Alan Moore's answer is good, but I would modify it a bit to make it more compact. For the regex compiler:

"([^"\\]*(\\.)*)*"

Compare with Alan Moore's expression:

"[^"\\]*(\\.[^"\\]*)*"

The explanation is very similar to Alan Moore's one:

The first part " matches a quotation mark.

The second part [^"\\]* matches zero or more of any characters other than quotation marks or backslashes.

And the last part (\\.)* matches backslash and whatever single character follows it. Pay attention on the *, saying that this group is optional.

The parts described, along with the final " (i.e. "[^"\\]*(\\.)*"), will match: "Some Text" and "Even more Text\"", but will not match: "Even more text about \"this text\"".

To make it possible, we need the part: [^"\\]*(\\.)* gets repeated as many times as necessary until an unescaped quotation mark turns up (or it reaches the end of the string and the match attempt fails). So I wrapped that part by brackets and added an asterisk. Now it matches: "Some Text", "Even more Text\"", "Even more text about \"this text\"" and "Hello\\".

In C# code it will look like:

var r = new Regex("\"([^\"\\\\]*(\\\\.)*)*\"");

BTW, the order of the two main parts: [^"\\]* and (\\.)* does not matter. You can write:

"([^"\\]*(\\.)*)*"

or

"((\\.)*[^"\\]*)*"

The result will be the same.

Now we need to solve another problem: \"foo\"-"bar". The current expression will match to "foo\"-", but we want to match it to "bar". I don't know

why would there be escaped quotes outside of quoted strings

but we can implement it easily by adding the following part to the beginning:(\G|[^\\]). It says that we want the match start at the point where the previous match ended or after any character except backslash. Why do we need \G? This is for the following case, for example: "a""b".

Note that (\G|[^\\])"([^"\\]*(\\.)*)*" matches -"bar" in \"foo\"-"bar". So, to get only "bar", we need to specify the group and optionally give it a name, for example "MyGroup". Then C# code will look like:

[TestMethod]
public void RegExTest()
{
    //Regex compiler: (?:\G|[^\\])(?<MyGroup>"(?:[^"\\]*(?:\.)*)*")
    string pattern = "(?:\\G|[^\\\\])(?<MyGroup>\"(?:[^\"\\\\]*(?:\\\\.)*)*\")";
    var r = new Regex(pattern, RegexOptions.IgnoreCase);

    //Human readable form:       "Some Text"  and  "Even more Text\""     "Even more text about  \"this text\""      "Hello\\"      \"foo\"  - "bar"  "a"   "b" c "d"
    string inputWithQuotedText = "\"Some Text\" and \"Even more Text\\\"\" \"Even more text about \\\"this text\\\"\" \"Hello\\\\\" \\\"foo\\\"-\"bar\" \"a\"\"b\"c\"d\"";
    var quotedList = new List<string>();
    for (Match m = r.Match(inputWithQuotedText); m.Success; m = m.NextMatch())
        quotedList.Add(m.Groups["MyGroup"].Value);

    Assert.AreEqual(8, quotedList.Count);
    Assert.AreEqual("\"Some Text\"", quotedList[0]);
    Assert.AreEqual("\"Even more Text\\\"\"", quotedList[1]);
    Assert.AreEqual("\"Even more text about \\\"this text\\\"\"", quotedList[2]);
    Assert.AreEqual("\"Hello\\\\\"", quotedList[3]);
    Assert.AreEqual("\"bar\"", quotedList[4]);
    Assert.AreEqual("\"a\"", quotedList[5]);
    Assert.AreEqual("\"b\"", quotedList[6]);
    Assert.AreEqual("\"d\"", quotedList[7]);
}
Alex
  • 423
  • 4
  • 7
2

I know this isn't the cleanest method, but with your example I would check the character before the " to see if it's a \. If it is, I would ignore the quote.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Krill
  • 293
  • 1
  • 4
  • 17
2

The regular expression

(?<!\\)".*?(?<!\\)"

will also handle text that starts with an escaped quote:

\"Some Text\" Some Text "Some Text", and "Some more Text" an""d "Even more text about \"this text\""
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Kamarey
  • 10,832
  • 7
  • 57
  • 70
1

Similar to RegexBuddy posted by @Blankasaurus, RegexMagic helps too.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Emre
  • 11
  • 1
0

A simple answer, without the use of ?, is

"([^\\"]*(\\")*)*\"

or, as a verbatim string

@"^""([^\\""]*(\\"")*(\\[^""])*)*"""

It just means:

  • find the first "
  • find any number of characters that are not \ or "
  • find any number of escaped quotes \"
  • find any number of escaped characters, that are not quotes
  • repeat the last three commands until you find "

I believe it works as good as @Alan Moore's answer, but, for me, is easier to understand. It accepts unmatched ("unbalanced") quotes as well.

Piotr Zierhoffer
  • 5,005
  • 1
  • 38
  • 59
  • 1
    I can see that this answer is a bit buggy, for some reason. Please refer to http://stackoverflow.com/questions/20196740/regex-matching-doesnt-finish – Piotr Zierhoffer Nov 25 '13 at 15:27
0

Any chance you need to do: \"[^\"\\\\]*(?:\\.[^\"\\\\]*)*\"

Fried Hoeben
  • 3,247
  • 16
  • 14
0

If you can define start and end, the following should work:

new Regex(@"^(""(.*)*"")$")
Babu James
  • 2,740
  • 4
  • 33
  • 50