1

Disclaimer: I'm not actually parsing a CSV, but a CSV-like format; I'm not interested in using a pre-built library.

What is the proper way to parse the following 2 lines?:

a,b"c"d,e
a,"bc"d,e

i.e., (a) with quotes in the middle of a value, and (b) with a quote at the start, but no end quote immediately before the next value.

I can't figure out how I should handle these cases (that would be the most intuitive).


My thoughts are that (a) should be parsed as a,b"c"d,e (quotes left in), and (b) should be parsed the same way, a,"bc"d,e. But then let me introduce a 3rd case, a,"b,c"d,e -- do we split on that 2nd comma between "b" and "c" or not?

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • Ultimately it depends on what exactly you allow between the separators - and how you want it to be interpreted. There's not a proper way, you have to define the rules. – moodywoody Mar 15 '12 at 05:16
  • @moodywoody: Well, how do other parsers handle these cases? As a user, if you were entering a comma-separated list of tags (similar to SO's tags but with a comma delimiter), what behaviour would you expect? – mpen Mar 15 '12 at 05:30
  • Me personally, when it comes to quotations and such, I don't expect anything - I ask for a definition (I know it's a bit of a cop out, but I like my problems to be "well defined"). Anyhow, you may find these two links dealing with perl and csv files interesting http://stackoverflow.com/questions/633607/how-can-i-parse-quoted-csv-in-perl-with-a-regex. http://www.en8848.com.cn/Reilly%20Books/perl3/cookbook/ch01_16.htm . – moodywoody Mar 15 '12 at 10:32
  • `"([^\"\\]*(?:\\.[^\"\\]*)*)"` -- that's a pretty sweet regex. I didn't think you could parse strings with a regex and do the escapes properly (have to make sure there's an odd number); I asked about it earlier to no avail. I'll have to dig into that a bit deeper. I've never used perl, but I think I get the jist of it. – mpen Mar 15 '12 at 15:29
  • If I'm reading that Perl solution correctly, `"bc"d` would be omitted completely, because there's no comma immediately after the end quote. I think I kind of like that... it's malformed, so it doesn't get included, but the whole CSV doesn't blow up. http://i.imgur.com/RFWoq.png – mpen Mar 15 '12 at 15:54
  • As said before, there are several "solutions" which depend on how you expect the output. If you have access to a copy of "Mastering regular expressions" by Friedl - there's a whole chapter about this class of problem (6.7. Unrolling the loop) – moodywoody Mar 15 '12 at 23:35
  • Now that I've thought about this a bit more... I think we can simplify the string matching a fair bit yet: `"(\\.|[^"\\])*"` – mpen Mar 16 '12 at 00:45
  • I understand you are not interested in a pre-built library-- hear me out-- check out https://github.com/liquidaty/zsv, which handles all these edge cases the same way Excel does and has built-in options to modify separators etc. The source code is there if you want to further modify to suit your purposes – mwag Jan 01 '22 at 21:31

2 Answers2

1

Here is how you would parse it if you want to be consistent with Excel:

input:

a,b"c"d,e
a,"bc"d,e
a,"b,c"d,e

parsed (in JSON):

[
  ["a", "b\"c\"d", "e"],
  ["a", "bcd", "e"],
  ["a","b,cd", "e"]
]

The parsing logic is:

  • if the first char of the row, or immediately after a comma, is a dbl-quote, then:
    • treat everything following it as "quoted" until you reach the closing quote (of course, treat two-quotes-in-a-row as an escaped dbl-quote). Remove the opening and the closing quote from the cell value
    • after the "end quote" has been reached, include anything after the end quote in the cell value as a literal value, until a comma or newline has been reached
  • otherwise, treat everything after the comma as a literal value, until a comma or newline has been reached

Note that this means that if you have a space after a cell-delimiting comma, followed by a dbl-quote, you get a different result than if you have no space after the comma (followed by a dbl-quote)

mwag
  • 3,557
  • 31
  • 38
0

here is the part of my method (C#) to spit csv to html table - it has such comma parsing:

string[] cells = line.Split(','); // empty cells needed as well
bool noComma = true;

for (int i = 0; i < cells.Length; i++)
{
    string cellI = cells[i];
    int numberOfDoubleQuote = cellI.Count(f => f == '"');
    // == 0 means comma in between quotes
    if (numberOfDoubleQuote == 1)
    {
        noComma = !noComma;
        if (!noComma)
            sLine += "<td>" + cellI;
        else
            sLine += "," + cellI + "</td>";
    }
    else
    {
        if (noComma)
            sLine += "<td>" + cellI + "</td>";
        else
            sLine += "," + cellI;
    }
}
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • So you essentially split on all commas and then "join" cells if they contain a single quote... but I don't think you've covered the cases very well. For example, how would it handle `a,b",c` ? If I'm reading this right, it would output `a`,`b",c` which seems wrong. – mpen Mar 15 '12 at 05:27