4

I need to read in many big CSV file to process in C++ (range from few MB to hundreds MB) At first, I open with fstream, use getline to read each line and use the following function to split each row"

template < class ContainerT >
void split(ContainerT& tokens, const std::string& str, const std::string& delimiters = " ", bool trimEmpty = false)
{
std::string::size_type pos, lastPos = 0, length = str.length();

using value_type = typename ContainerT::value_type;
using size_type = typename ContainerT::size_type;

while (lastPos < length + 1)
{
    pos = str.find_first_of(delimiters, lastPos);
    if (pos == std::string::npos)
    {
        pos = length;
    }

    if (pos != lastPos || !trimEmpty)
        tokens.push_back(value_type(str.data() + lastPos,
        (size_type)pos - lastPos));

    lastPos = pos + 1;
}
}

I tried boost::split,boost::tokenizer and boost::sprint and find the above give the best performance so far. After that, I consider read in the whole file into memory to process rather than keep the file opened, I use the following function to read in the whole file with the following function:

void ReadinFile(string const& filename, stringstream& result)
{
ifstream ifs(filename, ios::binary | ios::ate);
ifstream::pos_type pos = ifs.tellg();

//result.resize(pos);
char * buf = new char[pos];
ifs.seekg(0, ios::beg);
ifs.read(buf, pos);
result.write(buf,pos);
delete[]buf;

}

Both functions are copied somewhere from the net. However, I find that there is not much difference in performance between keep file opened or read in the whole file. The performance capture as follow:

Process 2100 files with boost::split (without read in whole file) 832 sec
Process 2100 files with custom split (without read in whole file) 311 sec
Process 2100 files with custom split (read in whole file) 342 sec

Below please find the sample content of one type of file(s), I have 6 types need to handle. But all are similar.

a1,1,1,3.5,5,1,1,1,0,0,6,0,155,21,142,22,49,1,9,1,0,0,0,0,0,0,0
a1,10,2,5,5,1,1,2,0,0,12,0,50,18,106,33,100,29,45,9,8,0,1,1,0,0,0
a1,19,3,5,5,1,1,3,0,0,18,0,12,12,52,40,82,49,63,41,23,16,8,2,0,0,0
a1,28,4,5.5,5,1,1,4,0,0,24,0,2,3,17,16,53,53,63,62,43,44,18,22,4,0,4
a1,37,5,3,5,1,1,5,0,0,6,0,157,22,129,18,57,11,6,0,0,0,0,0,0,0,0
a1,46,6,4.5,5,1,1,6,0,0,12,0,41,19,121,31,90,34,37,15,6,4,0,2,0,0,0
a1,55,7,5.5,5,1,1,7,0,0,18,0,10,9,52,36,86,43,67,38,31,15,5,7,1,0,1
a1,64,8,5.5,5,1,1,8,0,0,24,0,0,3,18,23,44,55,72,57,55,43,8,19,1,2,3
a1,73,9,3.5,5,1,1,9,1,0,6,0,149,17,145,21,51,8,8,1,0,0,0,0,0,0,0
a1,82,10,4.5,5,1,1,10,1,0,12,0,47,17,115,35,96,36,32,10,8,3,1,0,0,0,0

My questions are:

1 Why read in whole file will perform worse than not read in whole file ?

2 Any other better string split function?

3 The ReadinFile function need to read to a buffer and then write to a stringstream to process, any method to avoid this ? i.e. directly into stringstream

4 I need to use getline to parse each line (with \n) and use split to tokenize each row, any function similar for getline for string ? e.g. getline_str ? so that I can read into string directly

5 How about read the whole file into a string and then split the whole string into vector with '\n' and then split each string in vector with ',' to process ? Will this perform better ? And what is the limit (max size) of string ?

6 Or I should define a struct like this (based on the format)

struct MyStruct {
  string Item1;
  int It2_3[2];
  float It4;
  int ItRemain[23];
};

and read directly into a vector ? How to do this ?

Thanks a lot.

Regds

LAM Chi-fung

Chi-fung LAM
  • 195
  • 2
  • 11

6 Answers6

3

Whenever you have to care about performance, it's good to try alternatives and measure their performance. Some help implementing one option you ask about in your question below....

Given each structure you want to read, such as your example...

struct MyStruct {
  string Item1;
  int It2_3[2];
  float It4;
  int ItRemain[23];
};

...you can read and parse the fields using fscanf. Unfortunately, it's a C library function that doesn't support std::strings, so you'll need to a create character array buffer for each string field then copy from there to your structure's field. All up, something like:

char Item1[4096];
MyStruct m;
std::vector<MyStruct> myStructs;
FILE* stream = fopen(filename, "r");
assert(stream);
while (fscanf(stream, "%[^,],%d,%d,%f,%d,%d,%d,%d...",
              Item1, &m.It2_3[0], &m.It2_3[1], &m.It4,
              &m.ItRemain[0], &m.ItRemain[1], &m.ItRemain[2], ...) == 27)
{
    myStructs.push_back(m);
    myStructs.back().Item1 = Item1;  // fix the std::strings
}
fclose(stream);

(just put the right number of %ds in the format string and complete the other ItRemain indices).


Separately, I'm relucatant to recommend it as it's more advanced programming you may struggle with, but memory mapping the file and writing your own parsing has a good chance of being several times than the fscanf approach above (but again, you won't know until it's measured on your hardware). If you're a scientist trying to do something serious, maybe pair with a professional programmer to get this done for you.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • It works, just that %s has some problem, it read the whole line into it since the line has no space to terminate it, I try use %c%d (since this field only contain 1 char and 1 (1~2) digit number eg a1 a3 a5 a7 a9 g21 g23 etc.) still working on it. – Chi-fung LAM Apr 25 '18 at 11:34
  • 2
    Done. Finally I use: fscanf(stream, "%*[^,],%d,%*d,%f,%*s", something like that, I ignore the field that don't use and performance greatly improved! Now is well below 150 sec for 2100 files (6.3 GB data total). Thx. – Chi-fung LAM Apr 25 '18 at 17:57
1

One basic consideration when trying to craft a fast input routine is to avoid reading and handling each character from the file more than once. Granted this is not possible when converting to a numeric value as the conversion routine will rescan the characters, but on balance that is the goal. You should also try and limit the number of function calls and as much overhead as possible. When manipulating fields greater than 16-32 chars, the string and conversion function optimization will almost always outperform what you write on your own, but for smaller fields -- that's not always true.

As far as buffer size goes, the C/C++ library will provide a default read buffer derived from IO_BUFSIZ in the gcc source. The constant is available as BUFSIZ in C/C++. (with gcc it is 8192 bytes, with VS cl.exe it is 512 bytes), So when reading from the file, the I/O functions will have BUFSIZ chars available for use without going back to the disk. You can use this to your advantage. So whether you are processing a character at a time, or reading from the file into a 100k sized buffer, the number of disk I/O calls would be the same. (this is a bit counter-intuitive)

Reading into a buffer, and then calling strtok or sscanf are efficient, but when trying to eek every bit of speed out of your read, both involve traversing the characters you have already read, at a minimum, a second time, and with the conditionals and checks both provide -- you may be able to do a bit better.

I agree with Tony's answer whole-heartedly, you will just have to try different approaches, timing each, to determine what combinations will work best for your data.

In looking at your data, being a short char label and then mixed float and int values (of 1000 or less) to the end of each record, one optimization that comes to mind is to simply handle the label, and then treat the remaining values as float. The float representation of integers will be exact over the range of your values, so you can essentially handle the read and conversion (and storage) in a simplified form.

Assuming you do not know the number of records you have, nor the number of fields you have following each label, you need to start with a fairly generic read that will dynamically allocate storage for records, as required, and also, for the first record, allocate storage for as many fields as may be required until you have determined the number of fields in each record -- from that point on, you can allocate for an exact number of fields for each record -- and validate that each record has the same number of fields.

Since you are looking for speed, a simple C routine to read and allocate storage may provide advantages of the C++ implementation (it will certainly minimize the allocation for storage).

As a first attempt, I would approach the reading of the file with a character-oriented function like fgetc relying on the underlying BUFSIZ read-buffer to efficiently handle the disk I/O, and then simply write a state-loop to parse the values from each record into a stuct for storage.

A short example for you to test and compare with your other routines would be similar to the following. If you are on a Unix/Linux box, you can use clock_gettime for nanosecond timing, on windows, you will need QueryPerformanceCounter for microsecond timing. The read routine itself could be:

#include <stdio.h>
#include <stdlib.h>     /* for calloc, strtof */
#include <string.h>     /* for memset */
#include <errno.h>      /* strtof validation */

#define LABEL      3    /* label length (+1 for nul-character */
#define NRECS      8    /* initial number of records to allocate */
#define NFLDS  NRECS    /* initial number of fields to allocate */
#define FLDSZ     32    /* max chars per-field (to size buf) */

typedef struct {
    char label[LABEL];  /* label storage */
    float *values;      /* storage for remaining values */
} record_t;

/* realloc function doubling size allocated */
void *xrealloc (void *ptr, size_t psz, size_t *nelem);

int main (int argc, char **argv) {

    int lblflag = 1, n = 0; /* label flag, index for buf */
    size_t col = 0,         /* column index */
           idx = 0,         /* record index */
           ncol = 0,        /* fixed number of cols - 1st rec determines */
           nflds = NFLDS,   /* tracks no. of fields allocated per-rec */
           nrec = NRECS;    /* tracks no. of structs (recs) allocated */
    char buf[FLDSZ] = "";   /* fixed buffer for field parsing */
    record_t *rec = NULL;   /* pointer to record_t structs */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin; /* file or stdin */

    if (!fp) {  /* validate file open for reading */
        fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
        return 1;
    }

    /* allocate/validate initial storage for nrec record_t */
    if (!(rec = calloc (nrec, sizeof *rec))) {
        perror ("calloc-rec");
        return 1;
    }

    /* allocate/validate initial storage for nflds values */
    if (!(rec[idx].values = calloc (nflds, sizeof *rec[idx].values))) {
        perror ("calloc-rec[idx].values");
        return 1;
    }

    for (;;) {                          /* loop continually until EOF */
        int c = fgetc (fp);             /* read char */
        if (c == EOF)                   /* check EOF */
            break;
        if (c == ',' || c == '\n') {    /* field separator or \n reached */
            char *p = buf;              /* ptr for strtof validation */
            buf[n] = 0;                 /* nul-terminate buf */
            n = 0;                      /* reset buf index zero */
            if (!lblflag) {             /* not lblflag (for branch prediction) */
                errno = 0;              /* zero errno */
                rec[idx].values[col++] = strtof (buf, &p);  /* convert buf */
                if (p == buf) {     /* p == buf - no chars converted */
                    fputs ("error: no characters converted.\n", stderr);
                    return 1;
                }
                if (errno) {        /* if errno - error during conversion */
                    perror ("strof-failed");
                    return 1;
                }
                if (col == nflds && !ncol)  /* realloc cols for 1st row a reqd */
                    rec[idx].values = xrealloc (rec[idx].values, 
                                            sizeof *rec[idx].values, &nflds);
            }
            else {                      /* lblflag set */
                int i = 0;
                do {    /* copy buf - less than 16 char, loop faster */
                    rec[idx].label[i] = buf[i];
                } while (buf[i++]);
                lblflag = 0;            /* zero lblflag */
            }
            if (c == '\n') {        /* if separator was \n */
                if (!ncol)          /* 1st record, set ncol from col */
                    ncol = col;
                if (col != ncol) {  /* validate remaining records against ncol */
                    fputs ("error: unequal columns in file.\n", stderr);
                    return 1;
                }
                col = 0;            /* reset col = 0 */
                lblflag = 1;        /* set lblflag 1 */
                idx++;              /* increment record index */
                if (idx == nrec)    /* check if realloc required */
                    rec = xrealloc (rec, sizeof *rec, &nrec);
                /* allocate values for next record based on now set ncol */
                if (!(rec[idx].values = calloc (ncol, sizeof *rec[idx].values))) {
                    perror ("calloc-rec[idx].values");
                    return 1;
                }
            }
        }
        else if (n < FLDSZ) /* normal char - check index will fit */
            buf[n++] = c;   /* add char to buf */
        else {  /* otherwise chars exceed FLDSZ, exit, fix */
            fputs ("error: chars exceed FLDSZ.\n", stdout);
        }
    }
    if (fp != stdin) fclose (fp);   /* close file if not stdin */
    /* add code to handle last field on non-POSIX EOF here */
    if (!*rec[idx].label) free (rec[idx].values);  /* free unused last alloc */

    printf ("records: %zu   cols: %zu\n\n", idx, ncol); /* print stats */

    for (size_t i = 0; i < idx; i++) {      /* output values (remove) */
        fputs (rec[i].label, stdout);
        for (size_t j = 0; j < ncol; j++)
            printf (" %3g", rec[i].values[j]);
        free (rec[i].values);               /* free values */
        putchar ('\n');
    }
    free (rec);     /* free structs */

    return 0;
}

/** realloc 'ptr' of 'nelem' of 'psz' to 'nelem * 2' of 'psz'.
 *  returns pointer to reallocated block of memory with new
 *  memory initialized to 0/NULL. return must be assigned to
 *  original pointer in caller.
 */
void *xrealloc (void *ptr, size_t psz, size_t *nelem)
{   void *memptr = realloc ((char *)ptr, *nelem * 2 * psz);
    if (!memptr) {
        perror ("realloc(): virtual memory exhausted.");
        exit (EXIT_FAILURE);
    }   /* zero new memory (optional) */
    memset ((char *)memptr + *nelem * psz, 0, *nelem * psz);
    *nelem *= 2;
    return memptr;
}

Example Use/Output

$ ./bin/readlargecsvbuf dat/large.csv
records: 10   cols: 26

a1   1   1 3.5   5   1   1   1   0   0   6   0 155  21 142  22  49   1   9   1   0   0   0   0   0   0   0
a1  10   2   5   5   1   1   2   0   0  12   0  50  18 106  33 100  29  45   9   8   0   1   1   0   0   0
a1  19   3   5   5   1   1   3   0   0  18   0  12  12  52  40  82  49  63  41  23  16   8   2   0   0   0
a1  28   4 5.5   5   1   1   4   0   0  24   0   2   3  17  16  53  53  63  62  43  44  18  22   4   0   4
a1  37   5   3   5   1   1   5   0   0   6   0 157  22 129  18  57  11   6   0   0   0   0   0   0   0   0
a1  46   6 4.5   5   1   1   6   0   0  12   0  41  19 121  31  90  34  37  15   6   4   0   2   0   0   0
a1  55   7 5.5   5   1   1   7   0   0  18   0  10   9  52  36  86  43  67  38  31  15   5   7   1   0   1
a1  64   8 5.5   5   1   1   8   0   0  24   0   0   3  18  23  44  55  72  57  55  43   8  19   1   2   3
a1  73   9 3.5   5   1   1   9   1   0   6   0 149  17 145  21  51   8   8   1   0   0   0   0   0   0   0
a1  82  10 4.5   5   1   1  10   1   0  12   0  47  17 115  35  96  36  32  10   8   3   1   0   0   0   0

This may or may not be significantly faster than what you are using, but it would be worth a comparison -- as I suspect it may provide a bit of improvement.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
1

Eventually I use memory mapped file to solve my problem, performance is much better than I use fscanf Since I work on MS Windows, so I use Stephan Brumme's "Portable Memory Mapping C++ Class" http://create.stephan-brumme.com/portable-memory-mapping/ Since I don't need to deal with file(s) > 2 GB, My implementation is simpler. For over 2GB file, visit the web to see how to handle.

Below please find my piece of code:

// may tried RandomAccess/SequentialScan
MemoryMapped MemFile(FilterBase.BaseFileName, MemoryMapped::WholeFile, MemoryMapped::RandomAccess);

// point to start of memory file
char* start = (char*)MemFile.getData();
// dummy in my case
char* tmpBuffer = start;

// looping counter
uint64_t i = 0;

// pre-allocate result vector
MyVector.resize(300000);

// Line counter
int LnCnt = 0;

//no. of field
int NumOfField=43;
//delimiter count, num of field + 1 since the leading and trailing delimiter are virtual
int DelimCnt=NoOfField+1;
//Delimiter position. May use new to allocate at run time
// or even use vector of integer
// This is to store the delimiter position in each line
// since the position is relative to start of file. if file is extremely
// large, may need to change from int to unsigner, long or even unsigned long long
static  int DelimPos[DelimCnt];

// Max number of field need to read usually equal to NumOfField, can be smaller, eg in my case, I only need 4 fields
// from first 15 field, in this case, can assign 15 to MaxFieldNeed
int MaxFieldNeed=NumOfField;
// keep track how many comma read each line
int DelimCounter=0;
// define field and line seperator
char FieldDelim=',';
char LineSep='\n';

// 1st field, "virtual Delimiter" position
DelimPos[CommaCounter]=-1
DelimCounter++;

// loop through the whole memory field, 1 and only once
for (i = 0; i < MemFile.size();i++)
{
  // grab all position of delimiter in each line
  if ((MemFile[i] == FieldDelim) && (DelimCounter<=MaxFieldNeed)){
    DelimPos[DelimCounter] = i;
    DelimCounter++;
  };

  // grab all values when end of line hit
  if (MemFile[i] == LineSep) {
    // no need to use if (DelimCounter==NumOfField) just assign anyway, waste a little bit
    // memory in integer array but gain performance 
    DelimPos[DelimCounter] = i;
    // I know exactly what the format is and what field(s) I want
    // a more general approach (as a CSV reader) may put all fields
    // into vector of vector of string
    // With *EFFORT* one may modify this piece of code so that it can parse
    // different format at run time eg similar to:
    // fscanf(fstream,"%d,%f....
    // also, this piece of code cannot handle complex CSV e.g.
    // Peter,28,157CM
    // John,26,167CM
    // "Mary,Brown",25,150CM
    MyVector.StrField = string(strat+DelimPos[0] + 1, strat+DelimPos[1] - 1);
    MyVector.IntField = strtol(strat+DelimPos[3] + 1,&tmpBuffer,10);
    MyVector.IntField2 = strtol(strat+DelimPos[8] + 1,&tmpBuffer,10);
    MyVector.FloatField = strtof(start + DelimPos[14] + 1,&tmpBuffer);
    // reset Delim counter each line
    DelimCounter=0
    // previous line seperator treat as first delimiter of next line
    DelimPos[DelimCounter] = i;
    DelimCounter++
    LnCnt++;
  }
}
MyVector.resize(LnCnt);
MyVector.shrink_to_fit();
MemFile.close();
};

With this piece of code, I handle 2100 files (6.3 GB) in 57 seconds!!! (I code the CSV format in it and only grab 4 values from each line). Thx all people's help, you all inspire me in solveing this problem.

Chi-fung LAM
  • 195
  • 2
  • 11
0

Mainly you want to avoid copying.

If you can afford the memory to load the whole file into an array, then use that array directly, don't convert it back to a stringstream, as that makes another copy, just process the big buffer!

On the other hand that requires your machine to free up adequate RAM for your allocation, and possibly page some RAM to disk, which will be slow to process. The alternative is to load your file in large chunks, identify the lines in that chunk and only copy down the part-line at the end of the chunk before loading the next portion of file to concatenate to that part-line (a wrap and read).

Another option is that most operating systems provide a memory-mapped file view which means the OS does the file copying for you. These are more constrained (you have to use fixed block sizes and offsets) but will be faster.

You can use methods like strtok_r to split your file into lines, and lines into fields, though you need to deal with escaped field markers - you need to do that anyway. It is possible to write a tokeniser that works like strtok but returns string_view-like ranges instead of actually inserting null bytes.

Finally you may need to convert some of the field strings to numeric forms or otherwise interpret them. Ideally don't use istringstream, as that makes another copy of the string. If you must, perhaps craft your own streambuf that uses the string_view directly, and attach it to an istream?

So this should significantly reduce the amount of data copying going on, and should see a speed-up.

Be aware that you can only directly access the fields and lines that you have in your file read window. When you wrap and read any references you have into that data are useless rubbish.

Gem Taylor
  • 5,381
  • 1
  • 9
  • 27
  • Eventually I count the line by count(vect_char.begin(),vect_char.end(),'\n') I read in the whole file into vector, count lines, then break it into a vector Lines and resize the result vector to line count. Then break each line into another vector by ',' and process it. Now, processing 6.3 GB (total 2100 files) took around 200 sec. See if there is any method(s) can further reduce the time. – Chi-fung LAM Apr 25 '18 at 04:05
0

1 Why read in whole file will perform worse than not read in whole file ?

Three words: locality of reference.

On-chip operations of modern CPUs are ridiculously fast, to the point where in many situations the number of CPU-cycles a program requires to execute has only a very small effect on the overall performance of a program. Instead, often the time it takes to complete a task is mostly or totally determined by the speed at which the RAM subsystem can supply data to the CPU, or (even worse) the speed at which the hard disk can supply data to the RAM subsystem.

Computer designers try to hide the giant discrepancy between CPU-speed and RAM-speed (and the further giant discrepancy between RAM-speed and disk-speed) through caching; for example, when a CPU first wants to access data on a particular 4kB page of RAM, it's going to have to sit and twiddle its thumbs for (what seems to the CPU) a very long time before that data is delivered from RAM to the CPU. But after that first painful wait, the second CPU-access to nearby data within that same page of RAM will be quite fast, because at that point the page is cached within the CPU's on-chip cache and the CPU no longer has to wait for it to be delivered.

But the CPU's on-chip caches are (relatively) small -- nowhere near large enough to fit an entire 100+MB file. So when you load a giant file into RAM, you're forcing the CPU to do two passes across a large area of memory -- the first pass to read all the data in, and then second pass when you go back to parse all the data.

Assuming your program is RAM-bandwidth limited (and for this simple parsing task it definitely should be), that means two scans across the data will take roughly twice as long as doing everything within a single scan.

2 Any other better string split function?

I've always kind of liked strtok(), since you can be pretty confident it's not going to do anything inefficient (like call malloc()/free()) behind your back. Or if you wanted to get really crazy, you could write your own mini-parser using a char * pointer and a for-loop, although I doubt it would end up being noticeably faster than a strtok()-based loop anyway.

3 The ReadinFile function need to read to a buffer and then write to a stringstream to process, any method to avoid this ? i.e. directly into stringstream

I'd say just put a while()-loop around fgets(), and then after each call to fgets() has read in a line of CSV-text, have an inner while()-loop around strtok() to parse out the fields within that line. For maximum efficiency, it's hard to go wrong with good old-fashioned C-style I/O.

5 How about read the whole file into a string and then split the whole string into vector with '\n' and then split each string in vector with ',' to process ? Will this perform better ? And what is the limit (max size) of string ?

I seriously doubt you would get better performance doing that. The string class is not really designed to operate efficiently on multi-megabyte strings.

6 Or I should define a struct like this (based on the format) [...] and read directly into a vector ? How to do this ?

Yes, that's a good idea -- if you can do everything in a single pass you will come out ahead, efficiency-wise. You should just be able to declare (e.g.) a vector<struct MyStruct> and for each line you parse in the file, write the parsed values into a MyStruct object as you are parsing them (e.g. with atoi()), and then after the MyStruct object is fully populated/written-to, push_back(myStruct) to the end of the vector.

(The only thing faster than that would be to get rid of the vector<struct MyStruct> as well, and just do (whatever it is you need to do) with the data right there inside your parsing-loop, without bothering to store the entire data set in a big vector at all. That could be an option e.g. if you just needed to calculate the sum of all the items in each field, but OTOH it may not be possible for your use-case)

Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234
  • Yes, eventually I use traditional C stuff to get the job done: fscanf(stream, "%*[^,],%d,%*d,%f,%*s", I ignore the field that don't use (sacrifice flexibility with performance) and is well below 150 sec now (for 2100 files, 6.3 GB data). – Chi-fung LAM Apr 25 '18 at 18:00
0

What you need is memory-mapping.

You can find more here.

double-beep
  • 5,031
  • 17
  • 33
  • 41
DAG
  • 417
  • 4
  • 7