158

How do I get the substring " It's big \"problem " using a regular expression?

s = ' function(){  return " It\'s big \"problem  ";  }';     
David
  • 3,392
  • 3
  • 36
  • 47
  • 1
    How do you find "It's" in a string that only contains "Is"? I'd fix it for you, but I don't know which single-quote/escape conventions apply in the language you're using. – Jonathan Leffler Nov 01 '08 at 15:36
  • 3
    Duplicate of: [PHP: Regex to ignore escaped quotes within quotes](http://stackoverflow.com/q/5695240) – ridgerunner Oct 08 '11 at 14:03
  • 3
    Actually, looking at the dates, I see that the other question is a duplicate of this one. Either way, be sure to check out [my answer](http://stackoverflow.com/questions/5695240/php-regex-to-ignore-escaped-quotes-within-quotes/5696141#5696141). – ridgerunner Oct 08 '11 at 14:20
  • 1
    @ridgerunner: I'm voting to close this as you suggested. It's true other question is more recent, but it's also much better (thanks mostly to your answer). – Alan Moore Jul 16 '14 at 22:55

17 Answers17

206
/"(?:[^"\\]|\\.)*"/

Works in The Regex Coach and PCRE Workbench.

Example of test in JavaScript:

    var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
    var m = s.match(/"(?:[^"\\]|\\.)*"/);
    if (m != null)
        alert(m);
Philip Kirkbride
  • 21,381
  • 38
  • 125
  • 225
PhiLho
  • 40,535
  • 6
  • 96
  • 134
  • 35
    Makes sense. Plain english: Two quotes surrounding zero or more of "any character that's not a quote or a backslash" or "a backslash followed by any character". I can't believe I didn't think to do that... – Ajedi32 Jan 03 '14 at 22:17
  • 7
    I'll answer myself. =) `(?:...)` is a passive or non-capturing group. It means that it cannot be backreferenced later. – magras Oct 02 '14 at 16:27
  • after searching a lot and test a lot this is the real and only solution I found to this common problem. Thanks! – cancerbero Mar 16 '15 at 20:31
  • 13
    thanks for this. i wanted to match single quotes as well so i ended up adapting it to this: `/(["'])(?:[^\1\\]|\\.)*?\1/` – leo May 03 '15 at 02:47
  • With [`var s = ' my \\"new\\" string and \"this should be matched\"';`](https://jsfiddle.net/kmyxv9hj/), this approach will lead to unexpected results. – Wiktor Stribiżew Jul 25 '16 at 12:38
  • 1
    @WiktorStribiżew Your string doesn't conform to the description: a string including a part between double quotes, that can contain escaped double quotes. Not sure what you expect... – PhiLho Jul 26 '16 at 11:12
  • For those that are interested, placing `"\\."` first yields better performance. I assume it's because doing this first makes the extra lookup for backslash in `"[^"\\]"` redundant. Looking at the other answers such as Darrell's below gives more performant regex (and that's the one included in many Linux distros according to the answer.) So for performance go with `\"(\\.|[^\"])*\"`. Timing it in Python 3.7 gave 1.375 millis vs 1.55 millis. – Jawad Feb 08 '19 at 11:33
  • @nr5 I don't know Swift. Perhaps you need to double *all* backslashes, if it doesn't have special syntax for regexes. We usually to do this in C, Java, and so on because REs are just strings. (Assuming you talk about a syntax error, not a runtime error, it isn't clear, you don't even give the error message...) – PhiLho Sep 14 '19 at 09:02
  • Translation: Match quote, match single character except quote or backslash OR match 2 characters if the first is a backslash, match previous group zero or more times, match quote. – Ray Foss May 12 '20 at 18:39
  • Hey! What if i want to match the second occurence of a string with quotes? Like, "Test1" "Test2", match "Test2" only. – Raul Chiarella Sep 15 '22 at 12:39
  • It doesn't seem like it handles the following string: '"\\"'.match(/"(?:[^"\\]|\\.)*"/); null – Sam Goto Mar 29 '23 at 03:54
40

This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings

\"(\\.|[^\"])*\"
  • With [`var s = ' my \\"new\\" string and \"this should be matched\"';`](https://jsfiddle.net/kmyxv9hj/1/), this approach will lead to unexpected results. – Wiktor Stribiżew Jul 25 '16 at 12:38
  • 1
    c.nanorc was the first place I went. Couldn't get it to work as part of a C string literal until double-escaping everything like this `" \"(\\\\.|[^\\\"])*\" "` – hellork Nov 28 '18 at 09:57
  • This works with egrep and re_comp/re_exec functions from libc. – Kirill Frolov Jan 14 '19 at 10:43
23

As provided by ePharaoh, the answer is

/"([^"\\]*(\\.[^"\\]*)*)"/

To have the above apply to either single quoted or double quoted strings, use

/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Guy Bedford
  • 2,025
  • 1
  • 15
  • 4
  • 2
    This is the only set that worked for me with a single, large 1.5 KB quoted string containing 99 escapes. Every other expression on this page broke in my text editor with an overflow error. Though most here work in the browser, just something to keep in mind. Fiddle: https://jsfiddle.net/aow20y0L/ – Beejor Jun 04 '15 at 03:00
  • 3
    See @MarcAndrePoulin's answer below for explanation. – shaunc Aug 07 '15 at 21:00
11
/(["\']).*?(?<!\\)(\\\\)*\1/is

should work with any quoted string

  • 1
    Nice, but too flexible for the request (will match single quotes...). And can be simplified to /".*?(?<!\\)"/ unless I miss something. Oh, and some languages (eg. JavaScript) alas doesn't understand negative lookbehind expressions. – PhiLho Oct 30 '08 at 12:47
  • 2
    @PhiLho, just using a single (?<!\\\) would fail on escaped backslashes at the end of the string. True about look-behinds in JavaScript though. – Markus Jarderot Nov 01 '08 at 08:57
  • @PhiLho Your simplification with this input: `"Martha's"` would result in this match: `"Martha'`, which is incorrect. The matching group, to determine which type of quote is being used to open it, is important. – Swivel May 07 '21 at 19:19
  • @Swivel Note 1: there is a double backslash in my answer, somehow SO lost the second one (because of Markdown?). Should have protected in backticks. Note 2: Markus is right… So it is flawed. Unlike my (popular) answer… :-) Note 3: there are no single quotes in my expression, I don't see the problem you mention, and I can't reproduce it. (I say I don't handle single quotes as delimiters, as it wasn't the topic.) – PhiLho Jun 04 '21 at 14:13
  • 1
    @PhiLho Huh... weird. Not sure how I misinterpreted it the first time around. You're absolutely correct. I'm not sure how I mistook your original comment. – Swivel Aug 04 '21 at 16:12
  • This was great, thank you! I needed to match multiline strings so I added this: ```(["\'])(.|\r?\n)*?(?<!\\)(\\\\)*\1 ``` – Scott Jodoin Oct 23 '22 at 18:49
11

Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.

You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.

Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993

Something like this: "(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.

9
"(?:\\"|.)*?"

Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes

8
/"(?:[^"\\]++|\\.)*+"/

Taken straight from man perlre on a Linux system with Perl 5.22.0 installed. As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.

ack
  • 7,356
  • 2
  • 25
  • 20
5

This one works perfect on PCRE and does not fall with StackOverflow.

"(.*?[^\\])??((\\\\)+)?+"

Explanation:

  1. Every quoted string starts with Char: " ;
  2. It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
  3. Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
  4. Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
Vadim Sayfi
  • 51
  • 1
  • 2
3

An option that has not been touched on before is:

  1. Reverse the string.
  2. Perform the matching on the reversed string.
  3. Re-reverse the matched strings.

This has the added bonus of being able to correctly match escaped open tags.

Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match" Here, \"this "should" NOT match\" should not be matched and "should" should be. On top of that this \"should\" match should be matched and \"should\" should not.

First an example.

// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';

// The RegExp.
const regExp = new RegExp(
    // Match close
    '([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
    '((?:' +
        // Match escaped close quote
        '(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
        // Match everything thats not the close quote
        '(?:(?!\\1).)' +
    '){0,})' +
    // Match open
    '(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
    'g'
);

// Reverse the matched strings.
matches = myString
    // Reverse the string.
    .split('').reverse().join('')
    // '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'

    // Match the quoted
    .match(regExp)
    // ['"hctam "\dluohs"\ siht"', '"dluohs"']

    // Reverse the matches
    .map(x => x.split('').reverse().join(''))
    // ['"this \"should\" match"', '"should"']

    // Re order the matches
    .reverse();
    // ['"should"', '"this \"should\" match"']

Okay, now to explain the RegExp. This is the regexp can be easily broken into three pieces. As follows:

# Part 1
(['"])         # Match a closing quotation mark " or '
(?!            # As long as it's not followed by
  (?:[\\]{2})* # A pair of escape characters
  [\\]         # and a single escape
  (?![\\])     # As long as that's not followed by an escape
)
# Part 2
((?:          # Match inside the quotes
(?:           # Match option 1:
  \1          # Match the closing quote
  (?=         # As long as it's followed by
    (?:\\\\)* # A pair of escape characters
    \\        # 
    (?![\\])  # As long as that's not followed by an escape
  )           # and a single escape
)|            # OR
(?:           # Match option 2:
  (?!\1).     # Any character that isn't the closing quote
)
)*)           # Match the group 0 or more times
# Part 3
(\1)           # Match an open quotation mark that is the same as the closing one
(?!            # As long as it's not followed by
  (?:[\\]{2})* # A pair of escape characters
  [\\]         # and a single escape
  (?![\\])     # As long as that's not followed by an escape
)

This is probably a lot clearer in image form: generated using Jex's Regulex

Image on github (JavaScript Regular Expression Visualizer.) Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.

Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js

scagood
  • 784
  • 4
  • 11
2

here is one that work with both " and ' and you easily add others at the start.

("|')(?:\\\1|[^\1])*?\1

it uses the backreference (\1) match exactley what is in the first group (" or ').

http://www.regular-expressions.info/backref.html

  • 2
    this is a very good solution, but `[^\1]` should be replaced with `.` because there is no such thing as an anti-back-reference, and it doesn't matter anyways. the first condition will always match before anything bad could happen. – Seph Reed Nov 02 '17 at 06:15
  • **@SephReed** – replacing `[^\1]` with `.` would effectively change this regex to `("|').*?\1` and then it would match `"foo\"` in `"foo \" bar"`. That said, getting `[^\1]` to actually work is hard. **@​mathiashansen** – You're better off with the unwieldy and expensive `(?!\1).` (so the whole regex, with some efficiency cleanup, would be `(["'])(?:\\.|(?!\1).)*+\1`. The `+` is optional if your engine doesn't support it. – Adam Katz Jan 08 '19 at 21:31
0

If it is searched from the beginning, maybe this can work?

\"((\\\")|[^\\])*\"
cxw
  • 16,685
  • 2
  • 45
  • 81
0

A more extensive version of https://stackoverflow.com/a/10786066/1794894

/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/   

This version also contains

  1. Minimum quote length of 50
  2. Extra type of quotes (open and close )
Community
  • 1
  • 1
Rvanlaak
  • 2,971
  • 20
  • 40
0

One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).

Henrik Paul
  • 66,919
  • 31
  • 85
  • 96
  • 4
    True enough, but this problem is well within the capabilities of regexes, and there are a great many implementations of those. – Alan Moore Oct 30 '08 at 16:45
0

I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.

I ended up with a two-step solution that beats any convoluted regex you can come up with:

 line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
 line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful

Easier to read and probably more efficient.

Bigger
  • 1,807
  • 3
  • 18
  • 28
0

If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.

example in Java:

String s = "\"en_usa\":[^\\,\\}]+";

now you can use this variable in your regexp or anywhere.

Aramis NSR
  • 1,602
  • 16
  • 26
0
(?<="|')(?:[^"\\]|\\.)*(?="|')

" It\'s big \"problem " match result: It\'s big \"problem

("|')(?:[^"\\]|\\.)*("|')

" It\'s big \"problem " match result: " It\'s big \"problem "

ShenRuijie
  • 11
  • 2
-1

Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)

"(([^"\\]?(\\\\)?)|(\\")+)+"
Petter Thowsen
  • 1,697
  • 1
  • 19
  • 24