Fast Processing of Text Files in C

Question

I've written some C code (not a C pro 'though), which is supposed to be as fast as possible. The algorithm is finished and I'm pleased with it's speed. But before it starts, I have to get some information from a text file, which is way to slow.

Right now the processing of the text file needs about 3 seconds for bigger files, while the same file is processed by Java code in less than 1 second, because Java has premade methods like readline() in it's framwork which alone contains more than 100 lines of pure code.

Is there any comparable Framework for C? I couldn't find anything on Google, because no matter how I rephrased my search requests I would get nothing, but tutorials on how to user fopen()...

If you wonder why I don't use Java then: The algorithm itself is way faster in C.

Here is the code I use in C. What needs to be done is to process a .cnf file in DINMACS format.

    while ((temp = fgetc(fp)) != EOF)
    {   
        if (temp == 'c')
        {
            //evtl. im Labor auf 13 ändern
            while ((temp =fgetc(fp)) != 10 && temp != EOF);
        }

        if (temp == 'p')
        {
            while ((temp =fgetc(fp)) < '0' ||  temp > '9');

            while (temp != 32)
            {
                variablen= (variablen * 10) + (temp - '0');
                temp=fgetc(fp);

            }

            while ((temp =fgetc(fp)) < '0' ||  temp > '9');

            while ((temp!= 32) && (temp != 10 ) )
            {
                klauseln= (klauseln * 10) + (temp - '0');
                temp=fgetc(fp);
            }

            while ((temp != 10) && (temp != EOF))
            {
                temp=fgetc(fp);
            }

            break;
        }
    }

    phi = (int *) malloc(klauseln * variablen * sizeof(int));

    int zaehler2 = 0;
    for (int j = 0; j < klauseln; ++j)
    {
        for (int i = 0; i < variablen; ++i)
        {
            phi[zaehler2++] = 0;
        }
    }

    int zeile = 0;

    while ((temp = fgetc(fp)) != EOF)
    {   
        if (temp == 'c')
        {
            while ((temp =fgetc(fp)) != 10 && temp != EOF);
        }
        else
        {
            while (temp != '0')
            {                        
                    int neg = 1;
                    int wert = 0;

                    while (temp != 32)
                    {
                        if (temp == '-') 
                        {
                            neg = -1;
                        }
                        else
                        {
                            wert = (wert * 10) + (temp - '0');
                        }

                        temp = fgetc(fp);
                    }
                    phi[wert - 1 + zeile] = neg;
                    temp = fgetc(fp);    
            }

            zeile = zeile + variablen;
            temp = fgetc(fp);    
        }
    }

You have written very slow code to process the input file, where is it? _If you wonder why I don't use Java_: the file I/O should also be faster, but you must have it wrong, post the code. — Iharob Al Asimi, Jan 15 '15 at 18:14
When you tag a question as `[java]` it usually means you want the answer in Java. I wouldn't assume C is way faster than Java if it has been optimised. BTW readLine() is not meant to be fast, but easy to use. There is much faster ways to read data in Java. — Peter Lawrey, Jan 15 '15 at 18:36
Hey! I edited the code up below. I should add, that I'm not advanced in C, Java or programming in general. This is the first time I process any text file in C. — joniboni, Jan 15 '15 at 18:54
If you process character by character it would be *much* faster if you use `fread()` to read in a big buffer and then process *that* char by char. — Weather Vane, Jan 15 '15 at 18:58
@WeatherVane: You don't know that it would be faster to read as a single block. It's perfectly possible that `fread` is buffering already. The only way to know is to profile. — Adrian McCarthy, Jan 15 '15 at 19:13
@WeatherVane: if you read a regular file (and that's what OP states to do) then under 'normal' circumstances `fgetc()` is buffering internally so you can't expect to improve performance by using `fread()` and parsing the buffer — Ingo Leonhardt, Jan 15 '15 at 19:31
Look at functions defined in [stdio.h](http://www.cplusplus.com/reference/cstdio/), in particular [`fscanf()`](http://www.cplusplus.com/reference/cstdio/fscanf/) — francis, Jan 15 '15 at 19:31
Insure `temp` is an `int` for speed and correct functionality. — chux - Reinstate Monica, Jan 15 '15 at 19:47
@IngoLeonhardt: There are faster ways of extracting characters out of a buffer though. Hence the existence of getc and getc_unlocked. — doynax, Jan 15 '15 at 19:57
You might try using `getc()` instead of `fgetc()` as it may be more optimized. See, e.g., http://stackoverflow.com/questions/18480982/getc-vs-fgetc-what-are-the-major-differences — cfh, Jan 15 '15 at 20:28
I also wonder whether implementing your algorithm in Java is indeed slower than in C (at least if you avoid a few pitfalls like too many implicit/inadvertent object creations). When I was doing my own benchmarks in order to get my own opinion about the flame war of the Java vs. C folks about 10 years ago(!), I had a hard time finding benchmarks which were slower for Java; of course only for long-running (i.e. > 10s) programs because of the startup cost of the JVM. And I was biased towards C. Just-in-time compilation can do amazing things due to run-time information and global program access. — Peter - Reinstate Monica, Jan 16 '15 at 12:22
@doynax: Correct, cf. http://pubs.opengroup.org/onlinepubs/009695399/functions/getc_unlocked.html. — Peter - Reinstate Monica, Jan 16 '15 at 12:28
constants like `32` and `10` have portable counterparts like `' '` and `'\n'`. Please use them, you can get your code in an IBM machine and get surprised that 32 is not a space and 10 is not a newline character. Also, using constant names like `base` for 10 is good practice, case you have to switch to octal or hex. Using a `switch` clause is better style and frequently better performance as switch can be optimized better than a couple of nested `if`... `else` constructs. — Luis Colorado, Jan 16 '15 at 13:22
The doubly nested `for` loop can be improved with a **memset(3)** call. Stdio **fgetc(3)** routine is optimized for speed (it uses a buffer to speed things) so it's normally ok to call it. You will suffer no penalty for using it for character at a time processing. — Luis Colorado, Jan 16 '15 at 13:34

score 3 · Answer 1 · answered Jan 15 '15 at 19:31

To speed up code, you first check to see if there's a better algorithm.

There is nothing algorithmically wrong. You're processing each character, in sequence, without backtracking, so it's O(n), which is as good as you could expect.

So all you can do is try to find faster ways to do what you're already doing. To do that, you need to profile the code. You can't know where the time is being spent otherwise. If you don't know the most biggest bottleneck, you'll waste a lot of time trying to optimize the wrong spot.

It's possible that reading the file character by character is slow, and you might be better off reading the file in large chunks and then process the characters from memory. But it's also possible that fread is doing that for you behind the scenes, so it might not buy you anything.

Reducing the number of tests (comparisons) might help. For example, when you check for 10 (linefeed) or EOF, you have to do two tests for every character. If you read the file into memory first, you could append a sentinel 10 to the end of the buffer, and that loop would then have to check only for linefeeds.

something "algorithmically wrong" `while ((temp =fgetc(fp)) < '0' || temp > '9');` is a potential infinite loop. Once `EOF` occurs, loop never ends. — chux - Reinstate Monica, Jan 15 '15 at 21:24
@chux: point taken, but I regard that as a wrong implementation rather than an application of the wrong algorithm. My point was that you can't really do better than linear, and this algorithm (barring implementation errors) is linear. If it had been worse than linear, then, I'd claim, that it was a poor choice of algorithm. — Adrian McCarthy, Jan 16 '15 at 18:30

Weather Vane · Answer 2 · 2015-01-15T20:33:50.240

3

I ran a test that reads chars from a file using fgetc(), another using getc() ("e8" method) and a buffered version that collects the chars from a local buffer.

#include<stdio.h>
#include<stdlib.h>
#include<time.h>

#define BUFLEN  1024

FILE *fp;
char fname[] = "test.txt";
int bufsize, bufind;

int getachar() {
    static unsigned char buf[BUFLEN];
    if (bufind >= bufsize) {
        bufsize = fread(buf, sizeof(char), BUFLEN, fp);
        if (bufsize == 0)
            return -1;
        bufind = 0;
    }
    return buf[bufind++];
}

void WVmethod (void) {
    int temp, count=0;
    bufsize = bufind = 0;
    if ((fp = fopen(fname, "rt")) == NULL)
        return;
    while ((temp = getachar()) != -1) count++;
    fclose(fp);
    printf ("WV method read %d chars. ", count);
}

void OPmethod (void) {
    int temp, count=0;
    if ((fp = fopen(fname, "rt")) == NULL)
        return;
    while ((temp = fgetc(fp)) != EOF) count++;
    fclose(fp);
    printf ("OP method read %d chars. ", count);
}

void e8method (void) {
    int temp, count=0;
    if ((fp = fopen(fname, "rt")) == NULL)
        return;
    while ((temp = getc(fp)) != EOF) count++;
    fclose(fp);
    printf ("e8 method read %d chars. ", count);
}

int main()
{
    clock_t start, elapsed;
    int loop;

    for (loop=0; loop<3; loop++) {
        start = clock();
        WVmethod();
        elapsed = clock() - start;
        printf ("Clock ticks = %d\n", (int)elapsed);

        start = clock();
        OPmethod();
        elapsed = clock() - start;
        printf ("Clock ticks = %d\n", (int)elapsed);

        start = clock();
        e8method();
        elapsed = clock() - start;
        printf ("Clock ticks = %d\n", (int)elapsed);

        printf ("\n");
    }
    return 0;
}

Program output:

WV method read 24494400 chars. Clock ticks = 265
OP method read 24494400 chars. Clock ticks = 1575
e8 method read 24494400 chars. Clock ticks = 1544

WV method read 24494400 chars. Clock ticks = 266
OP method read 24494400 chars. Clock ticks = 1591
e8 method read 24494400 chars. Clock ticks = 1544

WV method read 24494400 chars. Clock ticks = 265
OP method read 24494400 chars. Clock ticks = 1607
e8 method read 24494400 chars. Clock ticks = 1545

edited Jan 15 '15 at 20:33

answered Jan 15 '15 at 19:39

Weather Vane

33,872
7
36
56

@chux improved as your suggestion. – Weather Vane Jan 15 '15 at 20:04
Interesting! Wonder why 2nd WV pass 374/249 slower than 1st WV pass? (Note: suggest `bufsize = 0` right after `fopen()`). – chux - Reinstate Monica Jan 15 '15 at 20:15
@chux oops yes, shouldn't rely on initialised static variables, but it was benign as `bufsize` starts and ends as 0. The timings always vary a bit, but are generally similar to those shown. There must have been an important Windows job on the second WV test. – Weather Vane Jan 15 '15 at 20:18
Could you also add a test using `getc()`? I wonder whether it differs in its caching behavior. – cfh Jan 15 '15 at 20:24
What compiler, library, and platform is this test on? – Adrian McCarthy Jan 15 '15 at 20:50
@AdrianMcCarthy MSVC 9.0 on Windows 7, 64-bit dual core 4Gb memory. But OP can use this code to check out if there is a benefit on his own system. – Weather Vane Jan 15 '15 at 20:54
I like the benchmark, it's also a bit unexpected. The fastest way is, btw, to read with a single low-level read(). That's true even if you afterwards loop over the data and do something with it for fairness, like compare it against EOF ;-). – Peter - Reinstate Monica Jan 15 '15 at 23:34
I found the difference between getchar and your "getachar" striking, because what else should getchar do, after all?? The answer is that apparently every getchar() in glibc must obtain a lock! (Cf. e.g.http://fossies.org/dox/glibc-2.20/getchar_8c_source.html.) It will likely not be different in other standard libs. That is the likely source of the performance degradation compared to block reads. Being multi threading safe comes at a cost. – Peter - Reinstate Monica Jan 16 '15 at 12:03
I still was curious and have made some additional testing myself to find out why the performance is at it is and finally found that it was the locking of the stdio - functions: Using `fgetc_unlocked()` I've got the best perfornance, also beating `getachar_unlocked()` which called `fread_unlocked()` internally – Ingo Leonhardt Jan 20 '15 at 16:46
@IngoLeonhardt that's interesting thank you. I suppose `getachar()` is now slower due to double buffering. I have always assumed that `fgetc()` will make a system call every time, and perhaps in earlier compilers that was true. – Weather Vane Jan 20 '15 at 17:00
Alt least on UNIX/Linux the f...() functions have always been buffered, that's the reason for my original comment. Your tests have been very interesting for me too because I've always imagined `fgetc()` being a function very similar to your `getachar()`. But now again I've learned sth. that could be very helpful. Thank you – Ingo Leonhardt Jan 20 '15 at 17:11

score 0 · Accepted Answer · answered Jan 15 '15 at 21:38

My guess is that you are looking for basic functions to read file and that reading characters one by one is not the direction you are looking for.

There are many functions to read and handle string in c. In stdio.h are some functions that could help you :

char * fgets ( char * str, int num, FILE * stream ) : read characters until end of line or end of file, if num is large enough.
int sscanf ( const char * s, const char * format, ...); : reads formatted input. For instance fscanf(line,"%d",&nb); will read an integer and place it into nb. It is not possible to call sscanf many times on the same string. But a bypass is to use strtok() of string.h to split a string using space " " as a separator.

Here a sample code doing the job :

#include <stdio.h>
#include <stdlib.h>
#include <string.h>


#define MAX_LINE_SIZE 1000
#define MAX_SEQ_SIZE 100

int main()
{
    FILE * pFile;
    char line [MAX_LINE_SIZE];
    int nbvar,nbclauses;
    int* array=NULL;
    int i=0;int j;
    pFile = fopen ("example.txt" , "r");
    if (pFile == NULL){ perror ("Error opening file");}
    else {
        while (fgets(line, MAX_LINE_SIZE, pFile) != NULL){
            printf("%s",line);fflush(stdout);
            // parsing the line
            if(line[0]!='c' && line[0]!='\0'){
                if(line[0]=='p'){
                    sscanf(line,"%*s%*s%d%d",&nbvar,&nbclauses);
                    array=malloc(MAX_SEQ_SIZE*nbclauses*sizeof(int));
                }else{
                    char * temp;
                    char stop=0;
                    j=0;
                    //strtok split the line into token
                    temp=strtok(line," ");
                    while(stop==0){
                        sscanf(temp,"%d",&array[i*(MAX_SEQ_SIZE)+j]);
                        temp=strtok(NULL," ");
                        if(array[i*MAX_SEQ_SIZE+j]==0){stop=1;}
                        j++;
                        printf("j %d\n",j );fflush(stdout);
                    }
                    i++;
                }

            }
        }
        fclose (pFile);
    }
    if(array!=NULL){
        for(i=0;i<nbclauses;i++){
            j=0;
            while(array[i*MAX_SEQ_SIZE+j]!=0){
                printf("line %d seq item %d worths %d\n",i,j,array[i*MAX_SEQ_SIZE+j]);
                j++;
            }
        }
        free(array);
    }
    return 0;
}

"It is not possible to call sscanf many times on the same string" Hmmm Many ways to do the `while(stop==0){ sscanf ...` loop with repetitive calls `sscanf()` calls on `temp/line`. — chux - Reinstate Monica, Jan 16 '15 at 20:19

Fast Processing of Text Files in C

3 Answers3