Performance bottleneck with CSV parser

Question

My current parser is given below - Reading in ~10MB CSV to an STL vector takes ~30secs, which is too slow for my liking given I've got over 100MB which needs to be read in every time the program is run. Can anyone give some advice on how to improve performance? Indeed, would it be faster in plain C?

int main() {
    std::vector<double> data;
    std::ifstream infile( "data.csv" );
    infile >> data;
    std::cin.get();
    return 0;
}

std::istream& operator >> (std::istream& ins, std::vector<double>& data)
{
    data.clear();

    // Reserve data vector
    std::string line, field;
    std::getline(ins, line);
    std::stringstream ssl(line), ssf;

    std::size_t rows = 1, cols = 0;
    while (std::getline(ssl, field, ',')) cols++;
    while (std::getline(ins, line)) rows++;

    std::cout << rows << " x " << cols << "\n";

    ins.clear(); // clear bad state after eof
    ins.seekg(0);

    data.reserve(rows*cols);

    // Populate data
    double f = 0.0;
    while (std::getline(ins, line)) {
        ssl.str(line);
        ssl.clear();
        while (std::getline(ssl, field, ',')) {
            ssf.str(field);
            ssf.clear();
            ssf >> f;
            data.push_back(f);
        }
    }
    return ins;
}

NB: I have also have openMP at my disposal, and the contents will eventually be used for GPGPU computation with CUDA.

Did you try profiling this to see where the bottleneck is ? Also what platform and compiler are you using, and what optimisation settings ? — Paul R, Apr 29 '13 at 22:14
try not using the std::vector and use a memory structure which preallocates the memory (array for example) — Kamyar Souri, Apr 29 '13 at 22:44
@PaulR - many thanks. I'm on MSVC 2010 and switching to release builds are much faster than debug - I take it you can't use compiler optimization in debug builds? — mchen, Apr 29 '13 at 22:49
You can have a "hybrid" build model, where you enable SOME optimization (in particular, turning off debugging of `operator[]` and iterators) [_ITERATOR_DEBUG_LEVEL](http://msdn.microsoft.com/en-us/library/hh697468.aspx), but still have debug symbols and not so aggressive inlining/code-munging that tends to lead to "undebuggable code" (because the generated code doesn't look like what you wrote, everything is in registers, etc). I doubt that use of `std::vector` is that bad. — Mats Petersson, Apr 29 '13 at 23:44
I would read the whole thing into memory, the parse it with raw code. I see that stuff inside the inner loop, like `ssf >> f` and `push_back`, and I see lots of room for probably speedup. — Mike Dunlavey, Apr 30 '13 at 21:49
It's worth mentioning that this will only parse extremely simple CSV... which might be fine for you. — Cory Nelson, May 29 '15 at 14:59

Olaf Dietsche · Answer 1 · 2013-04-29T22:40:58.467

5

You could half the time by reading the file once and not twice.

While presizing the vector is beneficial, it will never dominate runtime, because I/O will always be slower by some magnitude.

Another possible optimization could be reading without a string stream. Something like (untested)

int c = 0;
while (ins >> f) {
    data.push_back(f);
    if (++c < cols) {
        char comma;
        ins >> comma; // skip comma
    } else {
        c = 0; // end of line, start next line
    }
}

If you can omit the , and separate the values by white space only, it could be even

while (ins >> f)
    data.push_back(f);

or

std::copy(std::istream_iterator<double>(ins), std::istream_iterator<double>(),
          std::back_inserter(data));

edited Apr 29 '13 at 22:40

answered Apr 29 '13 at 22:13

Olaf Dietsche

72,253
8
102
198

1

@MiloChen It's more or less the C++ equivalent of `scanf`, but without a format string. It is type safe and reads formatted input from a stream. You can look at [`istream::operator>>`](http://www.cplusplus.com/reference/istream/istream/operator%3E%3E/) for more details. – Olaf Dietsche Apr 29 '13 at 22:51
You can ask the file system how big the string is rather than reading through it to get the size. You could then slam the entire string into a buffer in a single read. (Alternative, you could MMAP the file in 16M chunks sequentially.) In both cases a buffer scan with a pointer would then be extremely fast. – Ira Baxter Apr 29 '13 at 23:09
Thanks @IraBaxter, could you give a minimal example? – mchen Apr 29 '13 at 23:14
I'm really not a C++ coder; I'd give you a C style solution. I'm sure there are other folks here that give a C++ blessed version of my concept better than I can. [You should try Olaf's solution first, to see if its fast enough to make you happy]. – Ira Baxter Apr 29 '13 at 23:21

score 3 · Accepted Answer · answered Apr 29 '13 at 23:08

On my machine, your reserve code takes about 1.1 seconds and your populate code takes 8.5 seconds.

Adding std::ios::sync_with_stdio(false); made no difference to my compiler.

The below C code takes 2.3 seconds.

int i = 0;
int j = 0;
while( true ) {
    float x;
    j = fscanf( file, "%f", & x );
    if( j == EOF ) break;
    data[i++] = x;
    // skip ',' or '\n'
    int ch = getc(file);
}

score 2 · Answer 3 · edited May 23 '17 at 12:05

Try calling

std::ios::sync_with_stdio(false);

at the start of your program. This disables the (allegedly quite slow) synchronization between cin/cout and scanf/printf (I have never tried this myself, but have often seen the recommendation, such as here). Note that if you do this, you cannot mix C++-style and C-style IO in your program.

(In addition, Olaf Dietsche is completely right about only reading the file once.)

score -1 · Answer 4 · answered Apr 30 '13 at 05:07

-1

apparently, file io is a bad idea, just map the whole file into memory, access the csv file as a continous vm block, this incur only a few syscall

answered Apr 30 '13 at 05:07

xwlan

554
3
5

If you are going to read the entire file anyway then MMIO makes no difference. The file has to be read - no way to avoid this. MMIO will only put a layer between memory and disk for you (so it might even be slower than just plain file i/o). – Mike Lischke Apr 30 '13 at 13:46
no, not 'make no difference', mmio is not std buffered io, which don't buffer internally and copy to user buffer. furthermore, for big file, say several GB, file io can explode kernel's file system cache, mmio don't, at least for windows, i experienced this issue when i wrote D Probe, i finally manage to fix it by mmio, only map a small portion of the whole file into mm, slide the window when caller request to access next file block – xwlan Apr 30 '13 at 15:20

Performance bottleneck with CSV parser

4 Answers4