1

I am parsing a simple text file with two columns in C.

The two columns are separated by a tab. While I need the whole line in a later stage I also have to extract the value in the second column.

My implementation of this part is so far (reading a gzipped file):

while (! gzeof(fp)) {

   // here I keep the whole line since I need it later (can I do this also faster?)
   strcpy(line_save, line);

   // get the value in the second column (first removing the newline char.):
   line[strcspn(line, "\n")] = 0;
   linkage = strtok(line,"\t");
   linkage = strtok(NULL,"\t"); // here I have the value in the second col. as the result

   // do stuff

   gzgets(fp, line, LL);
}

What is a more time-efficient way to do this?

I am reading a gzipped file. gzeof() checks if EOF is reached and gzgets() reads one line.

I am not looking for an overly advanced solution here, but I am interested mainly in the "low-hanging fruits". However, if you can present more advances solutions I do not mind.

Michael
  • 706
  • 9
  • 29
  • 4
    The time to find a column will be dominated by the time it takes to read the file. – interjay Jul 19 '22 at 08:48
  • @interjay OK, thanks for pointing this out. What might be a solution here? Ramdisk? – Michael Jul 19 '22 at 08:57
  • 2
    What are `gzeof()` and `gzgets()`? – Zakk Jul 19 '22 at 09:03
  • @Zakk Check EOF for (gzeof) and read (gzgets) a gzipped file – Michael Jul 19 '22 at 09:06
  • 1
    As an obvious improvement you could remove the `\n` after splitting the line. `strcspn` has to walk all along the string from the beginning. Chopping the first part makes that walk shorter. – Gerhardh Jul 19 '22 at 09:06
  • 4
    I wold assume using `while (!gzeof())` is wrong in the same way as [`while (!feof())`](https://stackoverflow.com/questions/5431941/why-is-while-feoffile-always-wrong) – Gerhardh Jul 19 '22 at 09:08
  • @Michael Why don't you unzip the file first? – Zakk Jul 19 '22 at 09:10
  • @Zakk I wanted to simplify my workflow and keep the gzipped files. You mean unzip outside of the c-code before calling or within? – Michael Jul 19 '22 at 09:11
  • @Michael I mean either outside your code or within. – Zakk Jul 19 '22 at 09:13
  • Do you have more than 2 columns? If yes, you don't need to remove the `\n` at all as it will be chopped by the second `strtok` anyway. If no, you might not need the second `strtok` – Gerhardh Jul 19 '22 at 09:13
  • @Gerhardh I have exactly two columns. – Michael Jul 19 '22 at 09:14
  • 1
    Why not doing your own loop through the string. In the one loop you can copy the string, find both tabulators '\t' and extract your field. Or, at least you shall use more basic (character based) functions, like strchr instead of more general string based strcspn and strtok, considering you only used them with 1 character strings. – Marian Jul 19 '22 at 09:15
  • How often does the text file change? More specifically; can you do some kind of "if (original text file's modification time is newer than a pre-converted and cached binary file} { regenerate and cache the binary file}` followed by just using `mmap()` without any of the overhead of decompression or parsing? – Brendan Jul 19 '22 at 11:20

2 Answers2

1

I'm assuming that gzgets() behaves in a similar way to fgets():

ZEXTERN char * ZEXPORT gzgets OF((gzFile file, char *buf, int len));

Reads bytes from the compressed file until len-1 characters are read, or a newline character is read and transferred to buf, or an end-of-file condition is encountered. If any characters are read or if len == 1, the string is terminated with a null character. If no characters are read due to an end-of-file or len < 1, then the buffer is left untouched.

gzgets returns buf which is a null-terminated string, or it returns NULL for end-of-file or in case of error. If there was an error, the contents at buf are indeterminate.

char line[128]; // Extend as you see fit
while (gzgets(gzfile, line, sizeof(line))) {
    line[strcspn(line, "\n")] = '\0';
    
    char col1[64], col2[64];
    if (sscanf(line, " %63s\t%63[^\n]", col1, col2) != 2) {
        // Error while parsing the line
        puts("Error");
    }
    
    // Testing
    printf("col1: '%s'\ncol2: '%s'\n", col1, col2);
    
    // And line is untouched.
}

Edit: The below version should run slightly faster than the one above:

  • Removed the call for strcspn()
  • The for-loop stops when a \t is met, so this avoids scanning the entire string.
char line[128]; // Extend as you see fit
while (gzgets(gzfile, line, sizeof(line))) {
    char col1[64], col2[64];
    for (char *p = line; *p != '\0' && *p != '\n'; ++p) {
        if (*p == '\t') {
            strncpy(col1, line, p - line);
            strcpy(col2, p+1);
            break;
        }
    }
    
    // Testing
    printf("col1: '%s'\ncol2: '%s'\n", col1, col2);
    
    // And line is untouched.
}
Zakk
  • 1,935
  • 1
  • 6
  • 17
1

Try the following code. BTW, probably you do not need to create a copy of line in line_save as this code does not destruct original line. If this is the case you can break the inner loop after having set t2.

while (! gzeof(fp)) {
    int i, t1, t2;
    
    t1 = t2 = -1;
    for(i=0; line[i]!=0; i++) {
        line_save[i] = line[i];
        if (line[i] = '\t') {
            if (t1 < 0) t1 = i;
            else if (t2 < 0) t2 = i;
        }
    }
    line_save[i] = 0;

    if (t2 >= 0) {
        line[t2] = 0;
        linkage = &line[t1+1];
        // do what you need with 'linkage'

        // reconstruct the original line
        line[t2] = '\t';
    }

    // do other stuf with 'line'

    gzgets(fp, line, LL);
}
Marian
  • 7,402
  • 2
  • 22
  • 34