43

I would like to find a regex that will pick out all commas that fall outside quote sets.

For example:

'foo' => 'bar',
'foofoo' => 'bar,bar'

This would pick out the single comma on line 1, after 'bar',

I don't really care about single vs double quotes.

Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.

Braiam
  • 1
  • 11
  • 47
  • 78
SocialCensus
  • 1,249
  • 4
  • 15
  • 11

6 Answers6

97

This will match any string up to and including the first non-quoted ",". Is that what you are wanting?

/^([^"]|"[^"]*")*?(,)/

If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:

/(,)(?=(?:[^"]|"[^"]*")*$)/

which will match all of them. Thus

'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')

replaces all the commas not inside quotes with semicolons, and produces:

'test; a "comma,"; bob; ",sam,";here'

If you need it to work across line breaks just add the m (multiline) flag.

MarkusQ
  • 21,814
  • 3
  • 56
  • 68
  • This looks like it works properly - with double quotes. (,)(?=(?:[^"']|["|'][^"']*")*$) I believe works with single quote OR double quotes. Thanks! – SocialCensus Mar 11 '09 at 14:39
  • 1
    I wanted to point out that this does not work across line breaks. – SocialCensus Mar 11 '09 at 14:43
  • @SocialCensus Then use the m flag. Also, your example in the comment above has several bugs. For example, it takes double quotes, single quotes, and vertical bars as opening quotes but only takes double quotes as closing quotes. – MarkusQ Mar 11 '09 at 16:31
  • MarkusQ - You are quite correct, and I surrender my regex license. Yours works perfectly. Mine, not so much. – SocialCensus Mar 11 '09 at 19:26
  • @SocialCensus Don't surrender, fight harder! – MarkusQ Mar 11 '09 at 20:58
  • MArkusQ's doesn't work if there are an odd number of quotes in the string; according to regex buddy. – topwik Jul 11 '11 at 13:08
  • This is an awesome answer! Was able to use this to match xml node names and attribute names by just changing the comma to `\w+` : `(\w+)(?=(?:[^"]|"[^"]*")*$)` – Jason Nov 16 '11 at 19:50
  • Anybody want to be awesome and provide the equivalent regex that would be acceptable in vim? – Cory Klein Feb 14 '13 at 22:38
  • 1
    this works awesome UNTIL you have a single quote inbetween commas :-( – Chris Hayes Jan 14 '14 at 02:39
  • I found a solution that worked better for me by using a negative look-around works better. Was better to check against row beginning instead of row end for long messy rows https://stackoverflow.com/a/21106122/4469870 – puslet88 Jul 27 '18 at 09:43
18

The below regexes would match all the comma's which are present outside the double quotes,

,(?=(?:[^"]*"[^"]*")*[^"]*$)

DEMO

OR(PCRE only)

"[^"]*"(*SKIP)(*F)|,

"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.

DEMO


The below regex would match all the comma's which are present inside the double quotes,

,(?!(?:[^"]*"[^"]*")*[^"]*$)

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
3

While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.

This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:

function splitNotStrings(str){
  var parse=[], inString=false, escape=0, end=0

  for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
    if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
    if(c===','){
      if(!inString){
        parse.push(str.slice(end, i))
        end=i+1
      }
    }
    else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
      if(c===inString) inString=false
      else if(!inString) inString=c
    }
    escape=0
  }
  // now we finished parsing, strings should be closed
  if(inString) throw SyntaxError('expected matching '+inString)
  if(end<i) parse.push(str.slice(end, i))
  return parse
}

splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Tommy
  • 95
  • 1
  • 9
Touffy
  • 6,309
  • 22
  • 28
1

MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:

        String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);

Here's my extremely inelegant replacement that doesn't stack overflow:

private String[] extractCellsFromLine(String line) {
    List<String> cellList = new ArrayList<String>();
    while (true) {
        String[] firstCellAndRest;
        if (line.startsWith("\"")) {
            firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
        }
        else {
            firstCellAndRest = line.split("[\t,]", 2);                
        }
        cellList.add(firstCellAndRest[0]);
        if (firstCellAndRest.length == 1) {
            break;
        }
        line = firstCellAndRest[1];
    }
    return cellList.toArray(new String[cellList.size()]);
}
sullivan-
  • 468
  • 3
  • 7
1

@SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.

MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.

Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

Jose_X
  • 1,064
  • 8
  • 12
  • 5
    This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. – mattt Dec 23 '14 at 21:27
  • I have to take another look at this problem, but I noticed that my "answer" was rather long. Would that fit as a comment? Also, my old answer appears to answer that there is no necessarily single correct answer because of ambiguities in the question (I gave examples). I probably thought that this response/critique went beyond a remark to the author and adds context to those looking for an answer. Was I even able to edit the question or would I have to rely on someone else?[need to look further into this issue you raise when I find time] – Jose_X Mar 05 '15 at 18:09
  • @mattt didn't mean to appear to disregard your request. I'm short on time right now. – Jose_X Mar 05 '15 at 18:35
  • That comment was automatically generated from the flagged comment moderation tools. Really, my only advice would be to be _way_ less verbose. Stack Overflow rewards clear, concise answers that get to the point. – mattt Mar 05 '15 at 22:22
1

Try this regular expression:

(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,

This does also allow strings like “'foo\'bar' => 'bar\\',”.

Gumbo
  • 643,351
  • 109
  • 780
  • 844