C parsing a comma-separated-values with line breaks

Question

I have a CSV data file that have the following data:

H1,H2,H3
a,"b
c
d",e

When I open through Excel as CSV file, it is able to show the sheet with column headings as H1, H2, H3 and column values as: a for H1,

multi line value as
b
c
d
for H2

and c for H3 I need to parse this file using a C program and have the values picked up like this. But, my following code snippet will not work, as I have multi line values for a column:

char buff[200];
char tokens[10][30];
fgets(buff, 200, stdin);
char *ptok = buff; // for iterating
char *pch; 
int i = 0;
while ((pch = strchr(ptok, ',')) != NULL) {
  *pch = 0; 
  strcpy(tokens[i++], ptok);
  ptok = pch+1;
}
strcpy(tokens[i++], ptok);

How to modify this code snippet to accommodate multi-line values of columns? Please don't get bothered by the hard-coded values for the string buffers, this is the test code as POC. Instead of any 3rd party library, I would like to do it the hard way from first principle. Please help.

Parsing a CSV file is *deceptively* simple because there are many corner and special cases that are hard to remember to handle. Or just plain hard to handle. What if, for example, the multi-line string contained a comma? Try to find a library which can handle it for you instead. — Some programmer dude, Apr 04 '17 at 13:39
well for starters, you should look at making it so that your code can read in extra lines and have `buff` be any size rather than limited to 199 characters. — Chris Turner, Apr 04 '17 at 13:40
Please don't get bothered by the hard-coded values for the string buffers, this is the test code as POC. Instead of any 3rd party library, I would like to do it the hard way from first principle — Dr. Debasish Jana, Apr 04 '17 at 13:45
If you want to do it all yourself, then start by creating lots of unit-tests so you're sure that when you're finished it will be correct. Then for the actual parsing, you actually have to do some parsing of the contents, you can't just read line by line and use `strtok` to split the contents. I suggest using a larger buffer and read into it. Then process character by character, handling the comma (when not in a string) and handling strings and possible escapes. when you get to them. — Some programmer dude, Apr 04 '17 at 13:50
In http://stackoverflow.com/questions/32349263/c-regex-how-to-match-any-string-ending-with-or-any-empty-string/32351114#32351114 I provide a basic CSV parser in C. If linebreaks are within a quoted string, they are copied to the field being parsed. — Paul Ogilvie, Apr 04 '17 at 13:50

score 1 · Accepted Answer · answered Apr 05 '17 at 15:44

The main complication in parsing "well-formed" CSV in C is precisely the handling of variable-length strings and arrays which you are avoiding by using fixed-length strings and arrays. (The other complication is handling not well-formed CSV.)

Without those complications, the parsing is really quite simple:

(untested)

/* Appends a non-quoted field to s and returns the delimiter */
int readSimpleField(struct String* s) {
  for (;;) {
    int ch = getc();
    if (ch == ',' || ch == '\n' || ch == EOF) return ch;
    stringAppend(s, ch);
  }
}

/* Appends a quoted field to s and returns the delimiter.
 * Assumes the open quote has already been read.
 * If the field is not terminated, returns ERROR, which
 * should be a value different from any character or EOF.
 * The delimiter returned is the character after the closing quote
 * (or EOF), which may not be a valid delimiter. Caller should check.
 */
int readQuotedField(struct String* s) {
  for (;;) {
    int ch;
    for (;;) {
      ch = getc();
      if (ch == EOF) return ERROR;
      if (ch == '"') {
        ch = getc();
        if (ch != '"') break;
      }
      stringAppend(s, ch);
    }
  }
}

/* Reads a single field into s and returns the following delimiter,
 * which might be invalid.
 */
int readField(struct String* s) {
  stringClear(s);
  int ch = getc();
  if (ch == '"') return readQuotedField(s);
  if (ch == '\n' || ch == EOF) return ch;
  stringAppend(s, ch);
  return readSimpleField(s);
}

/* Reads a single row into row and returns the following delimiter,
 * which might be invalid.
 */
int readRow(struct Row* row) {
  struct String field = {0};
  rowClear(row);
  /* Make sure there is at least one field */
  int ch = getc();
  if (ch != '\n' && ch != EOF) {
    ungetc(ch, stdin);
    do {
      ch = readField(s);
      rowAppend(row, s);
    } while (ch == ',');
  }
  return ch;
}

/* Reads an entire CSV file into table.
 * Returns true if the parse was successful.
 * If an error is encountered, returns false. If the end-of-file
 * indicator is set, the error was an unterminated quoted field; 
 * otherwise, the next character read will be the one which
 * triggered the error.
 */
bool readCSV(struct Table* table) {
  tableClear(table);
  struct Row row = {0};
  /* Make sure there is at least one row */
  int ch = getc();
  if (ch != EOF) {
    ungetc(ch, stdin);
    do {
      ch = readRow(row);
      tableAppend(table, row);
    } while (ch == '\n');
  }
  return ch == EOF;
}

The above is "from first principles" -- it does not even use standard C library string functions. But it takes some effort to understand and verify. Personally, I would use (f)lex and maybe even yacc/bison (although it's a bit of overkill) to simplify the code and make the expected syntax more obvious. But handling variable-length structures in C will still need to be the first step.

C parsing a comma-separated-values with line breaks

1 Answers1