1

I have read other posts regarding copying data from files. Let me show you why my case is different. With C I must read 43 Million input lines from a csv file. The entries have no errors and are in this form:

int, int , int , variable length string with only alphanumerics and spaces , int \n

Thing is I'm copying all data to memory in array and lists to do some very very simple averages on it and then outputting all processed data to a file, nothing fancy. There are three main areas where I need help:

  1. Regarding the string, (my BIGGEST problem is here) it is first read from the file, then is copied to an array and then passed to another functions which will copy it into dynamic memory only if a condition is met. Ex:

    fileAnalizer(){
      while ( ! EOF ){
        char * s = function_to_read_string_from_file();
        data_to_array(s);
      }
      ....
      ....
      processData(arrays);
      dataToFiles(arrays);
    
    }
    
    void data_to_structures(char * s){
      if ( condition is met)
        char * aux = malloc((strlen(s)+1 )* sizeof(char));
        strcpy(aux,s);
      ....
      ...
    }
    

    As you can see the string is traveled through 3 times. I'm needing a way to do this process more efficiently, traveling the string fewer times. I have tried reading char by char and counting the string length but the entire process becomes slower.

  2. Efficient reading of input: would you recommend first copying all data into a buffer? If so: what type of buffer, and in many blocks or just one? This is my current reading program:

    void
    csvReader(FILE* f){
        T_structCDT c;
        c.string = malloc(MAX_STRING_LENGHT*sizeof(char));
        while (fscanf(f,"%d,%d,%d,%[^,],%d\n",&c.a, &c.b, &c.vivienda, c.c, &c.d)==5 ){
            data_to_structures(c);
        }
    }
    
  3. I then have nearly 500 hundred csv lines of processed data to dump in to other files. How would you recommend the dumping? Line by line or again sending the data to a buffer and then doing the dump? My code now looks something like this.

    void dataToFiles(arrayOfStructs immenseAr1, arrayOfStructs immenseAr2){
      for (iteration over immenseAr1) {
          fprintf(f1, "%d,%d,%s\n", immenseAr1[i].a, immenseAr1[i].b, inmenseAr1[i].string);
      }
      for (iteration over immenseAr2) {
          fprintf(f2, "%d,%d,%s\n", inmenseAr2[i].a, inmenseAr2[i].b, inmenseAr2[i].string);
      }
    }
    

I must read all data before dumping it. Would you recommend a different method other than storing all data into memory and then analyzing it and then dumping all the analyzed data? With the 20 Million lines the program is currently taking more than 40 seconds. I really need to lower that time.

  • *First*. If you have an optimizing compiler like gcc, then specify the -O2 or -O3 command line option. If you've posted an MCV example, then I might not be seeing just how many instructions might be optimized out of the execution. *Second*. Use a profiler. If there is a lot of skew in the sources of the contributors to response time, then you'll probably find a nugget. I do, however, see a call to calloc followed by a call to strcpy. Perhaps strdup would be a good substitute? But I have no clue if this is going to be a big contributor to response time. Use a profiler and you'll know for sure. – Jeff Holt Jul 07 '17 at 05:03
  • Well, without any numbers on the processing time vs. disk time and the like, it's difficult to suggest much. If the processing takes a lot of effort, you could possibly read it in chunks, processing one chunk while the next is being read in. Perhaps you could pool the chunk memory to avoid some mallocs. Without much more detail, it's difficult to see where economies might be made. There does seem to be a lot of copying? I assume your CPU and SSD hardware is already reasonably fast? – ThingyWotsit Jul 07 '17 at 05:11
  • How many of the 43 million lines do you have to store in memory? You say there are 500 lines to output — are they lines from the 43 million, or are they manufactured somehow? Which parts of the code are _actually_ using up the time? How did you measure that? I'd be surprised indeed if the `dataToFiles` processing is the bottleneck. How complex is the condition to be met? How are `data_to_array()` and `data_to_structures()` related? How is `function_to_read_string_from_file()` related to `csvReader()`? Your pseudo-code is not clearly coherent. You need to be crystal clear. – Jonathan Leffler Jul 07 '17 at 05:20

3 Answers3

1
  1. Use aux=strdup(s); instead of calloc(), strlen() & strcpy().

  2. Your OS (filesystem) will usually be very efficient at buffering the data stream. One might find better a more efficient way to buffer the data stream, but such attempts often just end up redundantly buffering what the OS is already buffering. Your OS most likely provides specific functions that would allow you to bypass the buffering usually done by the OS/filesystem. Generally, this would mean not using "stdio.h" functions such as fscanf(), etc.

  3. Again, be careful not to unnecessarily double buffer your data. Keep in mind that the OS will buffer your data, and actually write it out when it is generally most efficient to do so. (That is why there is a fflush() function... to suggest to the OS that you will wait until it has written all your data before proceeding.) And, just as there are usually specific OS functions to bypass the OS read buffers, there are generally OS specific functions to bypass the OS write buffers. However, these functions are probably beyond the scope of your needs (and perhaps this audience).

My summary answer (as I stated above) is that attempting to out-think the OS, and the way it is buffering your data streams, will usually result in less efficient code.

Mahonri Moriancumer
  • 5,993
  • 2
  • 18
  • 28
1

Try scanning through your big file without storing it all to memory, just keeping one record at a time in a local variables:

void csvReader(FILE *f) {
    T_structCDT c;
    int count = 0;
    c.string = malloc(1000);
    while (fscanf(f, "%d,%d,%d,%999[^,],%d\n", &c.a, &c.b, &c.vivienda, c.c, &c.d) == 5) {
        // nothing for now
        count++;
    }
    printf("%d records parsed\n");
}

Measure the time taken for this simplistic parser:

  • If it is quick enough, perform the tests for selection and output the few matching records one at a time at they are found during the parse phase. The extra time for these steps should be fairly small since only a few records match.

  • It the time is too long, you need a more fancy CSV parser, which is a lot of work but can be done and made fast, especially if you can assume your input file to use this simple format for all records. This is too broad a subject to detail here but the speed achievable should be close to that of cat csvfile > /dev/null or grep a_short_string_not_present csvfile

On my system (average linux server with regular hard disk), it takes less than 20 seconds to parse 40 million lines totalling 2GB from a cold start, and less than 4 seconds the second time: disk I/O seems to be the bottleneck.

If you need to perform this selection very often, you should probably use a different data format, possibly a database system. If the scan is performed occasionally on data whose format is fixed, using faster storage such as SSD will help but don't expect miracles.

EDIT To put words in action, I wrote a simple generator and extractor:

Here is a simple program to generate CSV data:

#include <stdio.h>
#include <stdlib.h>

const char *dict[] = {
    "Lorem", "ipsum", "dolor", "sit", "amet;", "consectetur", "adipiscing", "elit;",
    "sed", "do", "eiusmod", "tempor", "incididunt", "ut", "labore", "et",
    "dolore", "magna", "aliqua.", "Ut", "enim", "ad", "minim", "veniam;",
    "quis", "nostrud", "exercitation", "ullamco", "laboris", "nisi", "ut", "aliquip",
    "ex", "ea", "commodo", "consequat.", "Duis", "aute", "irure", "dolor",
    "in", "reprehenderit", "in", "voluptate", "velit", "esse", "cillum", "dolore",
    "eu", "fugiat", "nulla", "pariatur.", "Excepteur", "sint", "occaecat", "cupidatat",
    "non", "proident;", "sunt", "in", "culpa", "qui", "officia", "deserunt",
    "mollit", "anim", "id", "est", "laborum.",
};

int csvgen(const char *fmt, long lines) {
    char buf[1024];

    if (*fmt == '\0')
        return 1;

    while (lines > 0) {
        size_t pos = 0;
        int count = 0;
        for (const char *p = fmt; *p && pos < sizeof(buf); p++) {
            switch (*p) {
            case '0': case '1': case '2': case '3': case '4':
            case '5': case '6': case '7': case '8': case '9':
                count = count * 10 + *p - '0';
                continue;
            case 'd':
                if (!count) count = 101;
                pos += snprintf(buf + pos, sizeof(buf) - pos, "%d",
                                rand() % (2 + count - 1) - count + 1);
                count = 0;
                continue;
            case 'u':
                if (!count) count = 101;
                pos += snprintf(buf + pos, sizeof(buf) - pos, "%u",
                                rand() % count);
                count = 0;
                continue;
            case 's':
                if (!count) count = 4;
                count = rand() % count + 1;
                while (count-- > 0 && pos < sizeof(buf)) {
                    pos += snprintf(buf + pos, sizeof(buf) - pos, "%s ",
                                    dict[rand() % (sizeof(dict) / sizeof(*dict))]);
                }
                if (pos < sizeof(buf)) {
                    pos--;
                }
                count = 0;
                continue;
            default:
                buf[pos++] = *p;
                count = 0;
                continue;
            }
        }
        if (pos < sizeof(buf)) {
            buf[pos++] = '\n';
            fwrite(buf, 1, pos, stdout);
            lines--;
        }
    }
    return 0;
}

int main(int argc, char *argv[]) {
    if (argc < 3) {
        fprintf(stderr, "usage: csvgen format number\n");
        return 2;
    }
    return csvgen(argv[1], strtol(argv[2], NULL, 0));
}

Here is an extractor with 3 different parsing methods:

#include <stdio.h>
#include <stdlib.h>

static inline unsigned int getuint(const char *p, const char **pp) {
    unsigned int d, n = 0;
    while ((d = *p - '0') <= 9) {
        n = n * 10 + d;
        p++;
    }
    *pp = p;
    return n;
}

int csvgrep(FILE *f, int method) {
    struct {
        int a, b, c, d;
        int spos, slen;
        char s[1000];
    } c;
    int count = 0, line = 0;

    // select 500 out of 43M
#define select(c)  ((c).a == 100 && (c).b == 100 && (c).c > 74 && (c).d > 50)

    if (method == 0) {
        // default method: fscanf
        while (fscanf(f, "%d,%d,%d,%999[^,],%d\n", &c.a, &c.b, &c.c, c.s, &c.d) == 5) {
            line++;
            if (select(c)) {
                count++;
                printf("%d,%d,%d,%s,%d\n", c.a, c.b, c.c, c.s, c.d);
            }
        }
    } else
    if (method == 1) {
        // use fgets and simple parser
        char buf[1024];
        while (fgets(buf, sizeof(buf), f)) {
            char *p = buf;
            int i;
            line++;
            c.a = strtol(p, &p, 10);
            p += (*p == ',');
            c.b = strtol(p, &p, 10);
            p += (*p == ',');
            c.c = strtol(p, &p, 10);
            p += (*p == ',');
            for (i = 0; *p && *p != ','; p++) {
                c.s[i++] = *p;
            }
            c.s[i] = '\0';
            p += (*p == ',');
            c.d = strtol(p, &p, 10);
            if (*p != '\n') {
                fprintf(stderr, "csvgrep: invalid format at line %d\n", line);
                continue;
            }
            if (select(c)) {
                count++;
                printf("%d,%d,%d,%s,%d\n", c.a, c.b, c.c, c.s, c.d);
            }
        }
    } else
    if (method == 2) {
        // use fgets and hand coded parser, positive numbers only, no string copy
        char buf[1024];
        while (fgets(buf, sizeof(buf), f)) {
            const char *p = buf;
            line++;
            c.a = getuint(p, &p);
            p += (*p == ',');
            c.b = getuint(p, &p);
            p += (*p == ',');
            c.c = getuint(p, &p);
            p += (*p == ',');
            c.spos = p - buf;
            while (*p && *p != ',') p++;
            c.slen = p - buf - c.spos;
            p += (*p == ',');
            c.d = getuint(p, &p);
            if (*p != '\n') {
                fprintf(stderr, "csvgrep: invalid format at line %d\n", line);
                continue;
            }
            if (select(c)) {
                count++;
                printf("%d,%d,%d,%.*s,%d\n", c.a, c.b, c.c, c.slen, buf + c.spos, c.d);
            }
        }
    } else {
        fprintf(stderr, "csvgrep: unknown method: %d\n", method);
        return 1;
    }
    fprintf(stderr, "csvgrep: %d records selected from %d lines\n", count, line);
    return 0;
}

int main(int argc, char *argv[]) {
    if (argc > 2 && strtol(argv[2], NULL, 0)) {
        // non zero second argument -> set a 1M I/O buffer
        setvbuf(stdin, NULL, _IOFBF, 1024 * 1024);
    }
    return csvgrep(stdin, argc > 1 ? strtol(argv[1], NULL, 0) : 0);
}

And here are some comparative benchmark figures:

$ time ./csvgen "u,u,u,s,u" 43000000 > 43m
real    0m34.428s    user    0m32.911s    sys     0m1.358s
$ time grep zz 43m
real    0m10.338s    user    0m10.069s    sys     0m0.211s
$ time wc -lc 43m
 43000000 1195458701 43m
real    0m1.043s     user    0m0.839s     sys     0m0.196s
$ time cat 43m > /dev/null
real    0m0.201s     user    0m0.004s     sys     0m0.195s
$ time ./csvgrep 0 < 43m > x0
csvgrep: 508 records selected from 43000000 lines
real    0m14.271s    user    0m13.856s    sys     0m0.341s
$ time ./csvgrep 1 < 43m > x1
csvgrep: 508 records selected from 43000000 lines
real    0m8.235s     user    0m7.856s     sys     0m0.331s
$ time ./csvgrep 2 < 43m > x2
csvgrep: 508 records selected from 43000000 lines
real    0m3.892s     user    0m3.555s     sys     0m0.312s
$ time ./csvgrep 2 1 < 43m > x3
csvgrep: 508 records selected from 43000000 lines
real    0m3.706s     user    0m3.488s     sys     0m0.203s
$ cmp x0 x1
$ cmp x0 x2
$ cmp x0 x3

As you can see, specializing the parse method provides a gain of almost 50% and hand coding the integer conversion and string scanning gains another 50%. Using a 1 megabyte buffer instead of the default size offers only a marginal gain of 0.2sec.

To further improve the speed, you can use mmap() to bypass the I/O streaming interface and make stronger assumptions about the file contents. In the above code, invalid formats are still handled gracefully, but you could remove some tests and shave an extra 5% from the execution time at the cost of reliability.

The above benchmark is performed on a system with an SSD drive and the file 43m fits in RAM, so the timings do not include much disk I/O latency. grep is surprisingly slow, and increasing the search string length makes it even worse... wc -lc set a target for scanning performance, a factor of 4, but cat seems out of reach.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • I liked this answer. Yes, I have tried multiple solutions. I thought that looping with `getc()` would have been the faster but somehow `fscanf()` is the fastest. Its currently taking me 30 seconds to read all 40 Mill. I cant do any hardware changes, but you say `you need a more fancy CSV parser, which is a lot of work but can be done and made fast` would you give me an idea? – Emilio Basualdo Cibils Jul 09 '17 at 01:21
  • @PiloBasualdo: OK, I coded a benchmark for breakfast so you get an idea. A factor of 4x seems rather easy to obtain but if I/O latency is the bottleneck on your system, you may observe a much smaller improvement if at all. – chqrlie Jul 09 '17 at 09:00
-1

Choose the right tool

With so much data (you speak of 43 Million input lines from a csv file), the harddisk I/O will be the bottleneck, and since you are dealing with flat text files, each time you need to do a different calculation (if you change your mind and want to do slightly different very very simple averages on it and then outputting all processed data to a file), you will need to go through all this process each time.

A better strategy would be to use a database management system, that is the proper tool for storing and processing large amounts of data, and will give you the flexibility of making whatever processing you need, with indexed data, efficient handling of memory and caching, etc., with simple SQL commands.

If you do not want to set-up a SQL server (such as MySQL or PostgreSQL), you can use a database management system that does not require a server, such as SQLite: http://www.sqlite.org/ which you can, in addition, drive from the command-line with the sqlite3 shell program, or from a C program if you wish (SQLite is actually a C library), or by using with a GUI interface, such as http://sqlitestudio.pl/

SQLite will allow you to create your Database, create your Table, import CSV files into it, do your calculations, and dump the results in a variety of formats,...

Walk-through example with SQLite

Here is an example to get you started, illustrating use of the sqlite3 shell program and C code.

Let us say you have your data in data1.csv, in the fomat you described, containing:

1,2,3,variable length string with only alphanumerics and spaces,5
11,22,33,other variable length string with only alphanumerics and spaces,55
111,222,333,yet another variable length string with only alphanumerics and spaces,555

and in data2.csv containing:

2,3,4,from second batch variable length string with only alphanumerics and spaces,6
12,23,34,from second batch other variable length string with only alphanumerics and spaces,56
112,223,334,from second batch yet another variable length string with only alphanumerics and spaces,556

You create the database with the sqlite3 command-line utility, create the table with the proper format, import the CSV files, and issue SQL commands like this:

$ sqlite3 bigdatabase.sqlite3
SQLite version 3.8.7.1 2014-10-29 13:59:56
Enter ".help" for usage hints.
sqlite> create table alldata(col1 int, col2 int, col3 int, col4 varchar(255), col5 int);
sqlite> .mode csv
sqlite> .import data1.csv alldata
sqlite> .import data2.csv alldata
sqlite> select * from alldata;
1,2,3,"variable length string with only alphanumerics and spaces",5
11,22,33,"other variable length string with only alphanumerics and spaces",55
111,222,333,"yet another variable length string with only alphanumerics and spaces",555
2,3,4,"from second batch variable length string with only alphanumerics and spaces",6
12,23,34,"from second batch other variable length string with only alphanumerics and spaces",56
112,223,334,"from second batch yet another variable length string with only alphanumerics and spaces",556
sqlite> select avg(col2) from alldata;
82.5
sqlite> 

(press Ctrl-D to exit the SQLite shell)

Above, we have created a bigdatabase.sqlite3 file containing the created database handled by SQLite, a table alldata, we imported the CSV data into it, displayed the data contained in it (don't do that on 43 Million lines), and (calculated and) displayed the average of the integers contained in the column we named col2, which happens to be the second column.

You can use the created SQLite database with C and the SQLite library, to achieve the same.

Create a sqlite-average.c file (adapted from the example found in SQLite's Quickstart Page) like this:

#include <stdio.h>
#include <sqlite3.h>

static int callback(void *NotUsed, int argc, char **argv, char **azColName) {
    int i;
    for(i=0; i<argc; i++){
        printf("%s = %s\n", azColName[i], argv[i] ? argv[i] : "NULL");
    }
    printf("\n");
    return 0;
}

int main(void) {
    sqlite3 *db;
    char *zErrMsg = 0;
    int rc;

    rc = sqlite3_open("bigdatabase.sqlite3", &db);                                                      
    if (rc) {
        fprintf(stderr, "Can't open database: %s\n", sqlite3_errmsg(db));
        sqlite3_close(db);
        return 1;
    }
    rc = sqlite3_exec(db, "select avg(col2) from alldata;", callback, 0, &zErrMsg);
    if (rc!=SQLITE_OK){
        fprintf(stderr, "SQL error: %s\n", zErrMsg);
        sqlite3_free(zErrMsg);
    }
    sqlite3_close(db);

    return 0;
}

compile it, linking to the installed SQLite library, with gcc you do like this:

$ gcc -Wall sqlite-average.c -lsqlite3

run the compiled executable:

$ ./a.out
avg(col2) = 82.5

$

You will likely want to create indexes for columns where you look for data, for instance for columns 2 and 5 in this table, so that information is fetched quicker there:

sqlite> create index alldata_idx ON alldata(col2,col5);

Decide, if applicable, what column will contain the primary key for the table, etc.

For more information, check:

Community
  • 1
  • 1
Tardis
  • 465
  • 2
  • 10
  • 2
    This is a really great sqlite tutorial, but I'm not at all convinced that sqlite or sql are good fits for this. I don't think you'll get 43 million rows of anything into sqlite in 40 seconds, much less process. Big data reasoning is starting to apply around that size – Sam Hartman Jul 07 '17 at 13:07
  • If you really only have one calculation, and you know you need a full scan, then custom code may is likely the right answer. – Sam Hartman Jul 07 '17 at 13:08
  • @SamHartman, my point is that doing all the parsing and disk I/O involved in scanning a flat text file is the main time-consuming task, and that using the pure C program approach, he needs to repeat this process with text flat files if requirements, what he wants to calculate,... are changing. It is obvious that the OP does not need a one-time operation, otherwise, he would not be concerned that 40 seconds for whatever he is doing are too much. A DMBS will handle for him the importation of CSV data, the memory management problems, etc., once and for all, and he can concentrate on his queries. – Tardis Jul 07 '17 at 13:36
  • @SamHartman, you refer to "big data reasoning", if you mean NoSQL, see: https://sqlite.org/whentouse.html and https://stackoverflow.com/questions/1033309/sqlite-for-large-data-sets - also, the OP's use case is a consistent structured data, suitable for SQL, he seems to require read-only operations, with no joins,... Anyway, I gave him an easy howto to follow to try it out. SQLite was an example among other DMBS, which I still think are the proper tool for the job. I chose to show how to use it because it is easy to try it, and not necessarily because he should choose this one over others. – Tardis Jul 07 '17 at 14:27