1

I want to match everything but no quoted strings.

I can match all quoted strings with this: /(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))/ So I tried to match everything but no quoted strings with this: /[^(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))]/ but it doesn't work.

I would like to use only regex because I will want to replace it and want to get the quoted text after it back.

string.replace(regex, function(a, b, c) {
   // return after a lot of operations
});

A quoted string is for me something like this "bad string" or this 'cool string'

So if I input:

he\'re is "watever o\"k" efre 'dder\'4rdr'?

It should output this matches:

["he\'re is ", " efre ", "?"]

And than I wan't to replace them.

I know my question is very difficult but it is not impossible! Nothing is impossible.

Thanks

noob
  • 8,982
  • 4
  • 37
  • 65
  • 2
    Can you offer examples? What's your definition of a "quoted string"? – Rob W Dec 04 '11 at 13:32
  • 3
    Obviously, a quoted string is marked by quotation marks. But you have still not included the expected behaviour. For any given input, what's the expected output? – Rob W Dec 04 '11 at 13:50

3 Answers3

9

EDIT: Rewritten to cover more edge cases.

This can be done, but it's a bit complicated.

result = subject.match(/(?:(?=(?:(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*'(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*')*(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*$)(?=(?:(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*"(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*")*(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*$)(?:\\.|[^\\'"]))+/g);

will return

, he said. 
, she replied. 
, he reminded her. 
, 

from this string (line breaks added and enclosing quotes removed for clarity):

"Hello", he said. "What's up, \"doc\"?", she replied. 
'I need a 12" crash cymbal', he reminded her. 
"2\" by 4 inches", 'Back\"\'slashes \\ are OK!'

Explanation: (sort of, it's a bit mindboggling)

Breaking up the regex:

(?:
 (?=      # Assert even number of (relevant) single quotes, looking ahead:
  (?:
   (?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*
   '
   (?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*
   '
  )*
  (?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*
  $
 )
 (?=      # Assert even number of (relevant) double quotes, looking ahead:
  (?:
   (?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*
   "
   (?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*
   "
  )*
  (?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*
  $
 )
 (?:\\.|[^\\'"]) # Match text between quoted sections
)+

First, you can see that there are two similar parts. Both these lookahead assertions ensure that there is an even number of single/double quotes in the string ahead, disregarding escaped quotes and quotes of the opposite kind. I'll show it with the single quotes part:

(?=                   # Assert that the following can be matched:
 (?:                  # Match this group:
  (?:                 #  Match either:
   \\.                #  an escaped character
  |                   #  or
   "(?:\\.|[^"\\])*"  #  a double-quoted string
  |                   #  or
   [^\\'"]            #  any character except backslashes or quotes
  )*                  # any number of times.
  '                   # Then match a single quote
  (?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*'   # Repeat once to ensure even number,
                      # (but don't allow single quotes within nested double-quoted strings)
 )*                   # Repeat any number of times including zero
 (?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*      # Then match the same until...
 $                    # ... end of string.
)                     # End of lookahead assertion.

The double quotes part works the same.

Then, at each position in the string where these two assertions succeed, the next part of the regex actually tries to match something:

(?:      # Match either
 \\.     # an escaped character
|        # or
 [^\\'"] # any character except backslash, single or double quote
)        # End of non-capturing group

The whole thing is repeated once or more, as many times as possible. The /g modifier makes sure we get all matches in the string.

See it in action here on RegExr.

Community
  • 1
  • 1
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • @TimPietzcker Your pattern doesn't match anything at this: http://regexr.com?2vcmc (no match, even if it should!). The pattern produces the wrong results for http://regexr.com?2vcmc9 (Wrong match). – Rob W Dec 04 '11 at 17:41
  • 1
    @RobW: Your example at http://regexr.com/?2vcmc is faulty since the quotes aren't balanced (there is only one). And for http://regexr.com/?2vcmc9, I get an error message from RexExr ("Pattern not found"). – Tim Pietzcker Dec 04 '11 at 18:44
  • The test strings at these sites follow this format: `blabla"` (A single quote is still a character). The broken URL contained `blabla"still not a quote\". – Rob W Dec 04 '11 at 18:48
  • 2
    All these strings are "illegal" - unbalanced quotes. What would be the correct result for `"lkjdf"lkjsdf"`? – Tim Pietzcker Dec 04 '11 at 18:59
  • 1
    What if there is a double quoted substring containing one single quote? e.g. `Jeff said: "It doesn't work!" to Tim`? – ridgerunner Dec 05 '11 at 06:35
  • 1
    @ridgerunner: Good point about enclosed opposite quotes. I have "improved" my regex further, so it covers all my test cases. Not really very maintainable code, though... – Tim Pietzcker Dec 05 '11 at 10:03
1

Here is a tested function that does the trick:

function getArrayOfNonQuotedSubstrings(text) {
    /*  Regex with three global alternatives to section the string:
          ('[^'\\]*(?:\\[\S\s][^'\\]*)*')  # $1: Single quoted string.
        | ("[^"\\]*(?:\\[\S\s][^"\\]*)*")  # $2: Double quoted string.
        | ([^'"\\]*(?:\\[\S\s][^'"\\]*)*)  # $3: Un-quoted string.
    */
    var re = /('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|([^'"\\]*(?:\\[\S\s][^'"\\]*)*)/g;
    var a = [];                 // Empty array to receive the goods;
    text = text.replace(re,     // "Walk" the text chunk-by-chunk.
        function(m0, m1, m2, m3) {
            if (m3) a.push(m3); // Push non-quoted stuff into array.
            return m0;          // Return this chunk unchanged.
        });
    return a;
}

This solution uses the String.replace() method with a replacement callback function to "walk" the string section by section. The regex has three global alternatives, one for each section; $1: single quoted, $2: double quoted, and $3: non-quoted substrings, Each non-quoted chunk is pushed onto the return array. It correctly handles all escaped characters, including escaped quotes, both inside and outside quoted strings. Single quoted substrings may contain any number of double quotes and vice-versa. Illegal orphan quotes are removed and serve to divide a non-quoted section into two chunks. Note that this solution requires no lookaround and requires only one pass. It also implements Friedl's "Unrolling-the-Loop" efficiency technique and is quite efficient.

Additional: Here is some code to test the function with the original test string:

// The original test string (with necessary escapes):
var s = "he\\'re is \"watever o\\\"k\" efre 'dder\\'4rdr'?";
alert(s); // Show the test string without the extra backslashes.
console.log(getArrayOfNonQuotedSubstrings(s).toString());
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • It doesn't work for my example string: `he\'re is "watever o\"k" efre 'dder\'4rdr'?` Look at http://jsfiddle.net/Esn3m/ – noob Dec 05 '11 at 07:46
  • 1
    @micha - Try it again. My first post had the wrong regex (I've since fixed it). – ridgerunner Dec 05 '11 at 07:57
  • sorry but I cant' see any difference.. could you update my fiddle to a working one? – noob Dec 05 '11 at 08:11
  • 1
    @micha - The jsfiddle page you linked to has an error (you forgot a few backslashes in your console.log test string). Here is the corrected statement: `console.log(getArrayOfNonQuotedSubstrings("he\\'re is \"watever o\\\"k\" efre 'dder\\'4rdr'?").toString());`. The function works correctly as advertised! – ridgerunner Dec 05 '11 at 08:41
  • 1
    @micha - Added test code to the answer to show proper escaping of the subject string. – ridgerunner Dec 05 '11 at 08:54
-4

You can't invert a regex. What you have tried was making a character class out of it and invert that - but also for doing that you would have to escape all closing brackets "\]".

EDIT: I would have started with

/(^|" |' ).+?($| "| ')/

This matches anything between the beginning or the end of a quoted string (very simple: a quotation mark plus a blank) and the end of the string or the start of a quoted string (a blank plus a quotation mark). Of course this doesn't handle any escape sequences or quotations which don't follow the scheme / ['"].*['"] /. See above answers for more detailed expressions :-)

Bergi
  • 630,263
  • 148
  • 957
  • 1,375