3

I have large .txt files with more than a million lines and 7 colums of float numbers per line. The columns are seperated via spaces.

Currently, I import the files by reading each line (getline), transforming the line to a stream and then storing the seven values into array variables (please see my minimal example). However, this procedure is quite slow and takes around 10 minutes for 3 million lines (500MB). This corresponds to 0.8 MB/s and is much slower than it takes to write the files. My hard drive is SSD.

Can you give me advice of how to improve the efficiency of the code?

Bests, Fabian

C++

#include <iostream>
#include <string>
#include <sstream>
#include <fstream>

struct Container { double a, b, c, d, e, f, g; };

void read_my_file(std::ifstream &file, Container *&data) {
    std::string line;
    std::stringstream line_as_stream;
    unsigned int column;
    unsigned long int row;

    data = new Container[300000]; //dynamically allocated because the 
                                  //length is usually a user input.

    for (row = 0; row < 300000; row++) {
        getline(file, line);
        line_as_stream.str(line);

        for (column = 0; column < 7; column++) {
            line_as_stream >> data[row].a;
            line_as_stream >> data[row].b;
            line_as_stream >> data[row].c;
            line_as_stream >> data[row].d;
            line_as_stream >> data[row].e;
            line_as_stream >> data[row].f;
            line_as_stream >> data[row].g;
        }

        line_as_stream.clear();
    }
}

int main(void) {
    Container *data = nullptr;
    std::ifstream file;

    file.open("./myfile.txt", std::ios::in);
    read_my_file(file, data);
    std::cout << data[2].b << "\n";

    file.close();

    return 0;
}
Fabian K
  • 141
  • 1
  • 7
  • 2
    This answer [Efficiently reading a very large text file in C++](http://stackoverflow.com/questions/26736742/efficiently-reading-a-very-large-text-file-in-c) looks relevant. – rtur Jul 20 '16 at 08:28
  • Why dont you try using just `file >> some_string;` directly instead of first copying into a `stringstream` – Arunmu Jul 20 '16 at 08:32
  • 2
    Also, are you timing a release, optimized build of your application? Or is it a "debug", unoptimized version? – PaulMcKenzie Jul 20 '16 at 08:34
  • You can read millions of lines per second. The time is going in your processing for the lines, not the I/O. – user207421 Jul 20 '16 at 08:57
  • It would make things simpler (and probably a bit faster) if you just memory mapped the file. – Jesper Juhl Jul 20 '16 at 09:00
  • In my experience, `std::stringstream` is slow as hell, both when constructing it and when extracting data. Try replacing it with a plain `sscanf(line.c_str(), "%f %f %f %f %f %f %f", &data[row].a, &data[row].b, &data[row].c, &data[row].d, &data[row].e, &data[row].f, &data[row].g)` and see if the situation improves. – Matteo Italia Jul 20 '16 at 09:54
  • 1
    Writing is very fast, copy data to the file system cache and it will be written to disk later. Reading can't be as fast unless you have a time machine. Lots of RAM and starting your program right after the file was written helps. But 800 KB/sec is clearly too slow, `` is in general unfit for fast I/O. It was designed without consideration for threading and standard promises about std::locale too weak. Making it thread-safe was costly, lots fine-grained locks kill perf. Use a profiler so we don't have to guess. – Hans Passant Jul 20 '16 at 10:16

1 Answers1

-1

I think that is because C++ are not buffered streams by default. so in first loop you are getting only one line (since it is not buffered), through the iteration you are accessing the hard drive again and again(which will be pretty slow).you may want to look at this question, it might help you.

Community
  • 1
  • 1
Dante
  • 94
  • 3
  • 2
    Uhm, C++ streams *are* buffered by default, either directly or through `stdio` (which is buffered); also, even the operating system buffers reads, so it's unlikely that he is actually hitting the disk continuously. – Matteo Italia Jul 20 '16 at 09:22
  • hımm, i didnt know that. is it possibbly that the buffer size is small? if so changing the buffer size manually may help. – Dante Jul 20 '16 at 09:35