0

I'm trying to produce code to read input from a comma separated text file line by line. I'm only interested in 3 of the fields, so I'm skipping over the rest. Problem is that 2 of the fields are string contained in quotation marks, and one of them is actually optional.

For example, two successive lines might look like:

0,,10004,10004,"Albany Hwy After Galliers Av","",-32.13649428,116.0176090070,3
0,,10005,10005,"Albany Hwy Armadale Kelmscott Hospital","Armadale Kelmscott Hospital",-32.13481555555560,116.017707222222,3

Since I'm not interested in the strings (I'm only interested in a few of the numbers), I'm just skipping over them using the * operator in scanf. For the first string, it's easy, since that's a mandatory field, so I can just skip the first double quote match to a regex of everything up to the second double quote, like so:

\"%*[^\"]

What I'm having trouble with is the second field, right after the first. The problem is that this field is optional; thus it may have text, it may not. Whenever it doesn't, the regex I listed above doesn't work properly, and the whole scanf operation fails for that line. Despite my best efforts, I cannot produce a regex that successfully matches everything up to the second double quotation mark, as well as matching empty strings. Does anyone know how I could modify my regex to perform such a function?

p.s. here is an example of what my scanf operation looks like:

    res = sscanf(buf, "%*d,,%ld,%*ld,\"%*[^\"]\",\"%*[]\",%lf,%lf,%*d", &cursid, &curslat, &curslong);
Jabberwocky
  • 48,281
  • 17
  • 65
  • 115
Ammar
  • 1
  • 2
    The [`scanf` family of functions](http://en.cppreference.com/w/c/io/fscanf) does *not* use regular expressions. – Some programmer dude Sep 02 '15 at 09:22
  • Sorry, my bad, this isn't regex. It's just utilizing the %[chars] function of scanf, where everything matching [chars] is read. The question still stands as to how I can achieve what I'm trying to do. – Ammar Sep 02 '15 at 09:25
  • As for how to read your CSV file, it's not as simple as one would think because CSV file formats usually contains many corner cases that makes them harder to parse than one would think. Try to find an existing library to read and parse your file. – Some programmer dude Sep 02 '15 at 09:26
  • `fscanf` is totally unsuited for reading CSV. You should read line by line using `fgets` and then parse the string yourself. But as pointed out in another comment, try to find some code out there. – Jabberwocky Sep 02 '15 at 09:28
  • Well, I am reading the line using fgets, and then parsing it using sscanf. I'm not using fscanf – Ammar Sep 02 '15 at 09:29
  • @Ammar Sorry, I meant `sscanf` which is very similar to `fscanf`. – Jabberwocky Sep 02 '15 at 09:30
  • @Ammar [this SO article](http://stackoverflow.com/questions/12911299/read-csv-file-in-c) may help. – Jabberwocky Sep 02 '15 at 09:31
  • That will definitely help. Thanks – Ammar Sep 02 '15 at 09:45
  • @Michael-Walz, the solution you reference does not hande quoted fields, which can contain the separator character. – Paul Ogilvie Sep 02 '15 at 11:22
  • @PaulOgilvie yes, but it may help to get started. – Jabberwocky Sep 02 '15 at 11:25

1 Answers1

1

The following is a basic CSV parser:

void readCSVline(char *line);
char *readCSVfield(char *line, char *buf);
void readCSVdemo(void)
{
    char line[]= "0,,10004,10004,\"Albany Hwy After Galliers Av\",\"\",-32.13649428,116.0176090070,3";
    readCSVline(line);

}
/* readCSVline is where you put your "intelligence* about fields to read
 * and what to do with them
 */
void readCSVline(char *line)
{
    char field1[80], *lineptr=line;
    int nfields=0;

    while (*lineptr) {
        lineptr= readCSVfield(lineptr, field1);
        printf("%s\n", field1);
        nfields++;
    }
    printf("%d fields read.\n", nfields);
}
/* readCSVfield reads a field from a CSV line until the next comma or end-of-line.
 * It returns where the reading stopped.
 */
char *readCSVfield(char *line, char *buf)
{
    int instr= FALSE;   // track whether we are in a string
    char *cptr= line;

    while (*cptr)
    {
        if (instr) {
            if (*cptr=='"') {
                char cc= *++cptr;
                if (cc=='"')        // escaped double quote
                    *buf++ = '"';
                else {
                    *buf='\0';
                    cptr--;
                    instr= FALSE;
                }
            }
            else *buf++ = *cptr;
        }
        else switch (*cptr) {
        case '"': instr= TRUE; break;
        case ',': cptr++; *buf= '\0'; return(cptr);
        case ' ': case '\t': case '\n': case '\r': break;
        default: *buf++ = *cptr;
        }
        cptr++;
    }
    *buf= '\0';
    return(cptr);
}

Note: processing linefeeds in a quoted string

Often the parser is called with a line that the caller has read. To be able to process carriage return/linefeeds that are in a quoted string, the parser must process seeing a \n by getting the next line. The signature for readCSVfield should then include the line buffer and its size.

Paul Ogilvie
  • 25,048
  • 4
  • 23
  • 41