4

What is the best way to parse a CSV file with multiline fields and escaped quotes?

For example, this CSV

First field of first row,"This field is multiline

but that's OK because it's enclosed in double qoutes, and this
is an escaped "" double qoute" but this one "" is not
   "This is second field of second row, but it is not multiline
because it doesn't start 
with an immediate double quote"

looks in Excel like this:

enter image description here

Do I just preserve the state somehow, have a flag saying if the field I'm reading now has started with a qoute, etc.? Also, what happens to cases where there are double quotes but not at the start of the field, i.e. , "ABC", or ,"item "" item" "" are those considered invalid? Also, are "" quotes escaped if not inside a quoted field? Excel doesn't seem to.

Are there any other corner cases I might have missed?

sashoalm
  • 75,001
  • 122
  • 434
  • 781

2 Answers2

4

Your sample data has a single double quote just after 'double qoute' (sic) which terminates the double quoted start of field, but you have to continue reading until the next comma or the end of line. This is a malformed multi-line field, but you're seeing what Excel does with it. You can find a description of this in the (excellent) book The Practice of Programming (dead linkInternet Archive copy; live link at Princeton University) which includes a CSV parsing library in C and a reimplementation in C++, and which discusses this level of detail. There's also a standard for CSV in RFC 4180 "Common Format and MIME Type for Comma-Separated Values" and you can also study Wikipedia on the subject.


In the other answer is some sample code, still under test. It is pretty resilient within its limitations. Here it is modified into an SSCCE (Short, Self-Contained, Correct Example).

#include <stdbool.h>
#include <wchar.h>
#include <wctype.h>

extern const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline);

// Returns a pointer to the start of the next field,
// or zero if this is the last field in the CSV
// p is the start position of the field
// sep is the separator used, i.e. comma or semicolon
// newline says whether the field ends with a newline or with a comma
const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline)
{
    // Parse quoted sequences
    if ('"' == p[0]) {
        p++;
        while (1) {
            // Find next double-quote
            p = wcschr(p, L'"');
            // If we don't find it or it's the last symbol
            // then this is the last field
            if (!p || !p[1])
                return 0;
            // Check for "", it is an escaped double-quote
            if (p[1] != '"')
                break;
            // Skip the escaped double-quote
            p += 2;
        }
    }

    // Find next newline or comma.
    wchar_t newline_or_sep[4] = L"\n\r ";
    newline_or_sep[2] = sep;
    p = wcspbrk(p, newline_or_sep);

    // If no newline or separator, this is the last field.
    if (!p)
        return 0;

    // Check if we had newline.
    *newline = (p[0] == '\r' || p[0] == '\n');

    // Handle "\r\n", otherwise just increment
    if (p[0] == '\r' && p[1] == '\n')
        p += 2;
    else
        p++;

    return p;
}

static void dissect(const wchar_t *line)
{
    const wchar_t *start = line;
    const wchar_t *next;
    bool     eol;
    wprintf(L"Input: %d [%.*ls]\n", wcslen(line), wcslen(line)-1, line);
    while ((next = nextCsvField(start, L',', &eol)) != 0)
    {
        wprintf(L"Field: [%.*ls] (eol = %d)\n", (next - start - eol), start, eol);
        start = next;
    }
}

static const wchar_t multiline[] =
   L"First field of first row,\"This field is multiline\n"
    "\n"
    "but that's OK because it's enclosed in double quotes, and this\n"
    "is an escaped \"\" double quote\" but this one \"\" is not\n"
    "   \"This is second field of second row, but it is not multiline\n"
    "   because it doesn't start \n"
    "   with an immediate double quote\"\n"
    ;

int main(void)
{
    wchar_t line[1024];

    while (fgetws(line, sizeof(line)/sizeof(line[0]), stdin))
        dissect(line);
    dissect(multiline);

    return 0;
}

Example output

$ cat csv.data
a,bb, c ,d""e,f
1,"2","",,"""",4
$ ./wcsv < csv.data
Input: 16 [a,bb, c ,d""e,f]
Field: [a,] (eol = 0)
Field: [bb,] (eol = 0)
Field: [ c ,] (eol = 0)
Field: [d""e,] (eol = 0)
Field: [f] (eol = 1)
Input: 17 [1,"2","",,"""",4]
Field: [1,] (eol = 0)
Field: ["2",] (eol = 0)
Field: ["",] (eol = 0)
Field: [,] (eol = 0)
Field: ["""",] (eol = 0)
Field: [4] (eol = 1)
Input: 296 [First field of first row,"This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped "" double quote" but this one "" is not
   "This is second field of second row, but it is not multiline
   because it doesn't start 
   with an immediate double quote"]
Field: [First field of first row,] (eol = 0)
Field: ["This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped "" double quote" but this one "" is not] (eol = 1)
Field: [   "This is second field of second row,] (eol = 0)
Field: [ but it is not multiline] (eol = 1)
Field: [   because it doesn't start ] (eol = 1)
Field: [   with an immediate double quote"] (eol = 1)
$

I said 'within its limitations'; what are its limitations?

Primarily, it isolates the raw field, rather than the converted field. Thus, the field it isolates has to be modified to produce the 'real' value, with the enclosing double quotes stripped off, and internal doubled double quotes replaced by single quotes. Converting a raw field to the real value mimics quite a lot of the code in the nextCsvField() function. The inputs are the start of the field and the end of the field (the separator character). Here's a second SSCCE with an extra function csvFieldData(), and the dissect() function shown above revised to call it. The format of the dissected output is slightly different so it looks better:

#include <stdbool.h>
#include <wchar.h>
#include <wctype.h>

extern const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline);

// Returns a pointer to the start of the next field,
// or zero if this is the last field in the CSV
// p is the start position of the field
// sep is the separator used, i.e. comma or semicolon
// newline says whether the field ends with a newline or with a comma
const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline)
{
    // Parse quoted sequences
    if ('"' == p[0]) {
        p++;
        while (1) {
            // Find next double-quote
            p = wcschr(p, L'"');
            // If we don't find it or it's the last symbol
            // then this is the last field
            if (!p || !p[1])
                return 0;
            // Check for "", it is an escaped double-quote
            if (p[1] != '"')
                break;
            // Skip the escaped double-quote
            p += 2;
        }
    }

    // Find next newline or comma.
    wchar_t newline_or_sep[4] = L"\n\r ";
    newline_or_sep[2] = sep;
    p = wcspbrk(p, newline_or_sep);

    // If no newline or separator, this is the last field.
    if (!p)
        return 0;

    // Check if we had newline.
    *newline = (p[0] == '\r' || p[0] == '\n');

    // Handle "\r\n", otherwise just increment
    if (p[0] == '\r' && p[1] == '\n')
        p += 2;
    else
        p++;

    return p;
}

static wchar_t *csvFieldData(const wchar_t *fld_s, const wchar_t *fld_e, wchar_t *buffer, size_t buflen)
{
    wchar_t *dst = buffer;
    wchar_t *end = buffer + buflen - 1;
    const wchar_t *src = fld_s;

    if (*src == L'"')
    {
        const wchar_t *p = src + 1;
        while (p < fld_e && dst < end)
        {
            if (p[0] == L'"' && p+1 < fld_s && p[1] == L'"')
            {
                *dst++ = p[0];
                p += 2;
            }
            else if (p[0] == L'"')
            {
                p++;
                break;
            }
            else
                *dst++ = *p++;
        }
        src = p;
    }
    while (src < fld_e && dst < end)
        *dst++ = *src++;
    if (dst >= end)
        return 0;
    *dst = L'\0';
    return(buffer);
}

static void dissect(const wchar_t *line)
{
    const wchar_t *start = line;
    const wchar_t *next;
    bool     eol;
    wprintf(L"Input %3zd: [%.*ls]\n", wcslen(line), wcslen(line)-1, line);
    while ((next = nextCsvField(start, L',', &eol)) != 0)
    {
        wchar_t buffer[1024];
        wprintf(L"Raw Field: [%.*ls] (eol = %d)\n", (next - start - eol), start, eol);
        if (csvFieldData(start, next-1, buffer, sizeof(buffer)/sizeof(buffer[0])) != 0)
            wprintf(L"Field %3zd: [%ls]\n", wcslen(buffer), buffer);
        start = next;
    }
}

static const wchar_t multiline[] =
   L"First field of first row,\"This field is multiline\n"
    "\n"
    "but that's OK because it's enclosed in double quotes, and this\n"
    "is an escaped \"\" double quote\" but this one \"\" is not\n"
    "   \"This is second field of second row, but it is not multiline\n"
    "   because it doesn't start \n"
    "   with an immediate double quote\"\n"
    ;

int main(void)
{
    wchar_t line[1024];

    while (fgetws(line, sizeof(line)/sizeof(line[0]), stdin))
        dissect(line);
    dissect(multiline);

    return 0;
}

Example output

$ ./wcsv < csv.data
Input  16: [a,bb, c ,d""e,f]
Raw Field: [a,] (eol = 0)
Field   1: [a]
Raw Field: [bb,] (eol = 0)
Field   2: [bb]
Raw Field: [ c ,] (eol = 0)
Field   3: [ c ]
Raw Field: [d""e,] (eol = 0)
Field   4: [d""e]
Raw Field: [f] (eol = 1)
Field   1: [f]
Input  17: [1,"2","",,"""",4]
Raw Field: [1,] (eol = 0)
Field   1: [1]
Raw Field: ["2",] (eol = 0)
Field   1: [2]
Raw Field: ["",] (eol = 0)
Field   0: []
Raw Field: [,] (eol = 0)
Field   0: []
Raw Field: ["""",] (eol = 0)
Field   2: [""]
Raw Field: [4] (eol = 1)
Field   1: [4]
Input 296: [First field of first row,"This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped "" double quote" but this one "" is not
   "This is second field of second row, but it is not multiline
   because it doesn't start 
   with an immediate double quote"]
Raw Field: [First field of first row,] (eol = 0)
Field  24: [First field of first row]
Raw Field: ["This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped "" double quote" but this one "" is not] (eol = 1)
Field 140: [This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped " double quote" but this one "" is not]
Raw Field: [   "This is second field of second row,] (eol = 0)
Field  38: [   "This is second field of second row]
Raw Field: [ but it is not multiline] (eol = 1)
Field  24: [ but it is not multiline]
Raw Field: [   because it doesn't start ] (eol = 1)
Field  28: [   because it doesn't start ]
Raw Field: [   with an immediate double quote"] (eol = 1)
Field  34: [   with an immediate double quote"]
$ 

I've not tested with \r\n (or plain \r) line endings.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Thanks, I've been writing a function that tries to parse this, but I'll need to test it some more. May be I'll try posting it at the Code Review site. – sashoalm Mar 20 '13 at 14:40
1

I've written a function (which I'm still testing) that tries to parse this as well as it can.

// Returns a pointer to the start of the next field, or zero if this is the
// last field in the CSV p is the start position of the field sep is the
// separator used, i.e. comma or semicolon newline says whether the field ends
// with a newline or with a comma
static const wchar_t* nextCsvField(const wchar_t *p, wchar_t sep, bool *newline, const wchar_t **escapedEnd)
{
    *escapedEnd = 0;
    *newline = false;

    // Parse quoted sequences
    if ('"' == p[0]) {
        p++;
        while (1) {
            // Find next double-quote.
            p = wcschr(p, L'"');
            // If we didn't find it then this is the last field.
            if (!p)
                return 0;
            // Check for "", it is an escaped double-quote.
            if (p[1] != '"') {
                *escapedEnd = p;
                break;
            }
            // If it's the last symbol then this is the last field
            if (!p[1])
                return 0;
            // Skip the escaped double-quote
            p += 2;
        }
    }

    // Find next newline or comma.
    wchar_t newline_or_sep[4] = L"\n\r ";
    newline_or_sep[2] = sep;
    p = wcspbrk(p, newline_or_sep);

    // If no newline or separator, this is the last field.
    if (!p)
        return 0;

    // Check if we had newline.
    *newline = (p[0] == '\r' || p[0] == '\n');

    // Handle "\r\n", otherwise just increment
    if (p[0] == '\r' && p[1] == '\n')
        p += 2;
    else
        p++;

    return p;
}
sashoalm
  • 75,001
  • 122
  • 434
  • 781
  • See my analysis of this code in my answer. Summary: looks OK to me. Missing tests: `\r` line endings; `\r\n` line endings. The latter might cause my code some trouble. – Jonathan Leffler Mar 22 '13 at 07:44
  • Great work, thanks! I'll be accepting your answer. Btw, I had since updated my function a little bit, to include marking the closing quote so I can more easily escape the field later, without searching for it again. That's the new `const wchar_t **escapedEnd` parameter. See my own answer if you're interested, I've updated it. – sashoalm Mar 22 '13 at 08:00
  • Don't forget all the trailing data after the 'ending double quote' in your example. Also, you have to reduce embedded doubled double quotes to a single double quote in the middle of the section that is enclosed in double quotes. But thanks for the update (and the accept). – Jonathan Leffler Mar 22 '13 at 08:36
  • My idea was to escape the range [csvData+1, escapedEnd), and then just concatenate [escapeEnd+1, next). Of course, keeping in mind escapedEnd and/or next might be NULL. – sashoalm Mar 22 '13 at 08:39