17

I want to build a simple regex that covers quoted strings, including any escaped quotes within them. For instance,

"This is valid"
"This is \" also \" valid"

Obviously, something like

"([^"]*)"

does not work, because it matches up to the first escaped quote.

What is the correct version?

I suppose the answer would be the same for other escaped characters (by just replacing the respective character).

By the way, I am aware of the "catch-all" regex

"(.*?)"

but I try to avoid it whenever possible, because, not surprisingly, it runs somewhat slower than a more specific one.

arcain
  • 14,920
  • 6
  • 55
  • 75
PNS
  • 19,295
  • 32
  • 96
  • 143
  • possible duplicate of [How can I match double-quoted strings with escaped double-quote characters?](http://stackoverflow.com/questions/481282/how-can-i-match-double-quoted-strings-with-escaped-double-quote-characters) – Boann Jan 11 '14 at 12:36

6 Answers6

21

Here is one that I've used in the past:

("[^"\\]*(?:\\.[^"\\]*)*")

This will capture quoted strings, along with any escaped quote characters, and exclude anything that doesn't appear in enclosing quotes.

For example, the pattern will capture "This is valid" and "This is \" also \" valid" from this string:

"This is valid" this won't be captured "This is \" also \" valid"

This pattern will not match the string "I don't \"have\" a closing quote, and will allow for additional escape codes in the string (e.g., it will match "hello world!\n").

Of course, you'll have to escape the pattern to use it in your code, like so:

"(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")"
arcain
  • 14,920
  • 6
  • 55
  • 75
  • your regex will not work on `\"This is \" not supposed to be \" valid"` – maraaaaaaaa Dec 28 '16 at 18:01
  • @maksymiuk Yes, you're correct. I expect the string to start with a quote, and I don't check to see if that first quote is escaped. If that's something you need to do, the pattern can be adjusted to account for this by using negative look-ahead: `((?<!\\)"[^"\\]*(?:\\.[^"\\]*)*")` – arcain Jan 04 '17 at 00:31
  • 1
    username checks out – Dan Gravell May 02 '23 at 15:42
8

The problem with all the other answers is they only match for the initial obvious testing, but fall short to further scrutiny. For example, all of the answers expect that the very first quote will not be escaped. But most importantly, escaping is a more complex process than just a single backslash, because that backslash itself can be escaped. Imagine trying to actually match a string which ends with a backslash. How would that be possible?

This would be the pattern you are looking for. It doesn't assume that the first quote is the working one, and it will allow for backslashes to be escaped.

(?<!\\)(?:\\{2})*"(?:(?<!\\)(?:\\{2})*\\"|[^"])+(?<!\\)(?:\\{2})*"

Explanation:

(?<!\\) No backslashes behind (to make sure we start matching from first one)

(?:\\{2})* Any number of doubled backslashes (they nullify each other)

" Quote char

(?: Open group

(?<!\\) No backslashes behind (to make sure we start matching from first one)

(?:\\{2})* Any number of doubled backslashes (they nullify each other)

\\" Escaped quote char (because these are allowed inside the quotes)

| Or

[^"] Anything other than a quote char

) Close group

+ 1 or more of what the group matched

(?<!\\) No backslashes behind (to make sure we start matching from first one)

(?:\\{2})* Any number of doubled backslashes (they nullify each other)

" Quote char

maraaaaaaaa
  • 7,749
  • 2
  • 22
  • 37
4

Try this one... It prefers the \", if that matches, it will pick it, otherwise it will pick ".

"((?:\\"|[^"])*)"

Once you have matched the string, you'll need to take the first captured group's value and replace \" with ".

Edit: Fixed grouping logic.

agent-j
  • 27,335
  • 5
  • 52
  • 79
  • That doesn't work. When I try it on the string `"Lorem \"ipsum\" tritani impedit civibus ei pri`, RegexBuddy tells me it takes 215 steps to (incorrectly) match `"Lorem \"ipsum\"`. Compare that to @arcain's solution, which takes only 15 steps to (correctly) report an unsuccessful match attempt. – Alan Moore Jun 29 '11 at 21:46
  • @Alan, that's really interesting that mine matches arguably invalid data. I am glad you shared that with me -- it's like getting a (good) code review on my regexes. Sometime I'm going to have to invest in that RegexBuddy tool. – agent-j Jun 29 '11 at 22:22
  • 2
    Yeah, it's very handy, but if you haven't invested in [MRE](http://www.oreilly.com/catalog/regex3/index.html) yet, do that first. – Alan Moore Jun 29 '11 at 22:52
  • Actually, the pattern I provided evolved from one in Mastering Regular Expressions. I think I've been using it (the pattern) for almost ten years now. – arcain Jun 30 '11 at 00:08
2

Please find in the below code comprising expression evaluation for String, Number and Decimal.

public static void commaSeparatedStrings() {        
    String value = "'It\\'s my world', 'Hello World', 'What\\'s up', 'It\\'s just what I expected.'";

    if (value.matches("'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+'(((,)|(,\\s))'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+')*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

/**
 * 
 */
public static void commaSeparatedDecimals() {
    String value = "-111.00, 22111.00, -1.00";
    // "\\d+([,]|[,\\s]\\d+)*"
    if (value.matches(
            "^([-]?)\\d+\\.\\d{1,10}?(((,)|(,\\s))([-]?)\\d+\\.\\d{1,10}?)*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}

/**
 * 
 */
public static void commaSeparatedNumbers() {
    String value = "-11, 22, -31";      
    if (value.matches("^([-]?)\\d+(((,)|(,\\s))([-]?)\\d+)*")) {
        System.out.println("Valid...");
    } else {
        System.out.println("Invalid...");
    }
}
Dinesh Lomte
  • 569
  • 7
  • 5
2

This

("((?:[^"\\])*(?:\\\")*(?:\\\\)*)*")

will capture all strings (within double quotes), including \" and \\ escape sequences. (Note that this answer assumes that the only escape sequences in your string are \" or \\ sequences -- no other backslash characters or escape sequences will be captured.)

("(?:         # begin with a quote and capture...
  (?:[^"\\])* # any non-\, non-" characters
  (?:\\\")*   # any combined \" sequences
  (?:\\\\)*   # and any combined \\ sequences
  )*          # any number of times
")            # then, close the string with a quote

Try it out here!

Also, note that maksymiuk's accepted answer contains an "edge case" ("Imagine trying to actually match a string which ends with a backslash") which is actually just a malformed string. Something like

"this\"

...is not a "string ending on a backslash", but an unclosed string ending on an escaped quotation mark. A string which truly ends on a backslash would look like

"this\\"

...and the above solution handles this case.


If you want to expand a bit, this...

(\\(?:b|t|n|f|r|\"|\\)|\\(?:(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2}))|\\(?:u(?:[0-9a-fA-F]{4})))

...captures all common escape sequences (including escaped quotes):

(\\                       # get the preceding slash (for each section)
  (?:b|t|n|f|r|\"|\\)     # capture common sequences like \n and \t

  |\\                     # OR (get the preceding slash and)...
  # capture variable-width octal escape sequences like \02, \13, or \377
  (?:(?:[0-2][0-9]{1,2}|3[0-6][0-9]|37[0-7]|[0-9]{1,2}))

  |\\                     # OR (get the preceding slash and)...
  (?:u(?:[0-9a-fA-F]{4})) # capture fixed-width Unicode sequences like \u0242 or \uFFAD
)

See this Gist for more information on the second point.

awwsmm
  • 1,353
  • 1
  • 18
  • 28
2

It works for me and it is simpler than current answer

(?<!\\+)"(\\"|[^"])*(?<!\\+)"

(?<!\\+) - before " not must be \, and this expression is left and right.

(\\"|[^"])* - that inside quotes: might be escaped quotes \\" or anything for except quotes [^"]

Current regexp work correctly for follow strings:

234 - false or null

"234" - true or ["234"]

"" - true or [""]

"234 + 321 \\"24\\"" - true or ["234 + 321 \\"24\\""]

"234 + 321 \\"24\\"" + 123 + "\\"test(\\"235\\")\\"" - true

or ["234 + 321 \\"24\\"", "\\"test(\\"235\\")\\""]

"234 + 321 \\"24\\"" + 123 + "\\"test(\\"235\\")\\"\\" - true

or ["234 + 321 \\"24\\""]

Igor Zvyagin
  • 454
  • 3
  • 13
  • Not only is `"234 + 321 \\"24\\""` not supposed to be true (the quotes are not escaped, only the backslashes are escaped), but also if you add any amount of pairs of backslashes before the first quote, it will not recognize the first quote. Backslashes in pairs represent an actual backslash which is escaped – maraaaaaaaa Sep 27 '21 at 16:04