1

I'm writing a regular expression using javascript that is intended to capture string literals in javascript code in all the permutations that are allowed in javascript. This is what I've come up with:

([\"\'])(.*?(?:(\\"|\\').*?\3.*?)*?)\1

Description: The regular expression captures the starting quotation mark (" or ') in capture group 1 and the quotation mark is repeated at the end (\1) of the expression to enclose the full string literal. Since the "body" of the string literal can contain substrings enclosed in escaped quotation marks (example: "ab\"cd\"ef") I allow for matched pairs of escaped single and double quotations to occur within the string literal text. Capture group 3 is used to match starting and ending escaped quotation marks. The content of the string literal will be in capture group 2 with the outer quotation marks removed (the mark used to enclose the string will be in capture group 1). Note that I use (?:..) to make one of the groups non-capturing.

I've tested the expression on the strings below and it seems to be working:

"abcdefg"                  // Simple string literal using ".."
'abcdefg'                  // Simple string literal using '..'    
"a\"b\"c\"d\"e\'f\'g"      // Escaped matched singles and doubles
"a\"b\"\"c\"\'d\'\'e\'fg"  // Another variant
"\"ab\"\'cd\'ef\"\"\'\'g"  // Zero length escaped sequences
"a'b'cd'ef'g"              // Enclosed in doubles, singles in middle
'"ab"cd"e""f"g'            // Enclose in singles, doubles in middle

My question is if there are any other permutations that are allowed in javascript that I need to consider. Note that single quotation sequences enclosed within a double quotation string literal ("ab'cde'fg") and double quotation sequences enclosed within a single quotation string literal ('ab"cde"fg') do not need to be handled separately (I think), since the pattern matches the enclosing outer quotation marks. I would also appreciate feedback regarding any potential cross-browser issues - if there are browsers that don't support regular expressions at all or don't support features I use here (such as capturing groups or non-capturing syntax).

Edit: I am attempting to capture escaped string literals embedded in a string literal. That makes this problem statement different than that expressed in regex-for-quoted-string-with-escaping-quotes

Community
  • 1
  • 1
instantMartin
  • 85
  • 2
  • 8
  • Did you want something like [codereview.stackexchange.com](http://codereview.stackexchange.com)? – revo Mar 10 '15 at 08:43
  • possible duplicate of [Regex for quoted string with escaping quotes](http://stackoverflow.com/questions/249791/regex-for-quoted-string-with-escaping-quotes). The regex "([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\' looks a good-enough answer. – Wiktor Stribiżew Mar 10 '15 at 09:25
  • Thanks @revo for the tip on[link](http://codereview.stackexchange.com/). That is a better place for the type of question. I'll keep it in mind next time. – instantMartin Mar 10 '15 at 12:35
  • @stribizhev - I was orginally looking for a solution with matched pairs of escaped sequences "ab\"de\"fg", not "ab\"defg", but thinking about it some more I've realized that that just skipping any escaped character fits my current need. I'll need to think some more on it, but probably the that solution will suffice. – instantMartin Mar 10 '15 at 12:40
  • I decided to keep the question. Although escaping will solve my immediate problem, I do want to be able to separate escaped string literals (enclosed in the same starting and ending escaped literal). This might be easier to achieve, though, using a two-step process where the outer (non-escaped) string literal is identified and then the literal is analyzed separately. – instantMartin Mar 11 '15 at 06:36

1 Answers1

2

You accept the three-letter sequence "\" as a string. The .* is too inclusive, you need to also avoid it matching backslashes.

Maybe (['"])(?:(?!(?:\\|\1)).|\\.)*\1:
Match ' or " as delimiter
Then match any sequence of
- non-backslash, non delimiter, non-line terminator character
or
- backslash followed by any non-line terminator character
then match the delimiter again.

You can still be thrown off by a delimiter occurring in a comment or RegExp literal, fx

var m = /"/g.exec("a string"); // Matches a '"' char
//       ^^^^^^^^^^        ^^^^^^^^^^^^^^^^^^^ not strings!

so it's not perfect for finding all strings in a JavaScript source. For that you actually need to parse it.

lrn
  • 64,680
  • 7
  • 105
  • 121
  • Thanks @Irn. That was the type of solution I was looking for using with negative lookahead. I was originally looking to matching escaped sequencies "ab\"cd\"ef", but not "ab\"dcef", but realized I don't really need this. For curiosity's sake, what could a RegExp literal look like that would throw it off? – instantMartin Mar 10 '15 at 12:50
  • this fails if \n is inside string literals – Dee Feb 11 '20 at 03:29
  • fails with .exec('foo "fffxxx\n" bar') or .exec('foo "fff\nxxx" bar') – Dee Feb 11 '20 at 03:31
  • how to allow escape sequence in string literals? – Dee Feb 11 '20 at 03:34
  • sorry, it works! it must be .exec('foo "fffxxx\\n" bar') instead. – Dee Feb 11 '20 at 04:59