1

What's the most efficient (not prone to errors / "proper" in general) way (if exists) of handling data from files in C++, line by line? That is, only one line from a file will be used at a time to perform some lengthy calculations before moving to the next one. I've thought of following options but can't decide which one is more appropriate.

  1. At the moment I'm doing something like (open, do all stuff, close at the end):

    string line; 
    fstream myfile;
    int numlines = 1000;
    myfile.open("myfile.csv");
    for(int i = 0; i < numlines; i++){
        getline(myfile, line); 
        // do something using read data
    };
    myfile.close();
    
  2. Open and close right after data is read (wouldn't hurt speed too much as calculations go much longer than data import):

    string line; 
    fstream myfile;
    int numlines = 1000;
    for(int i = 0; i < numlines; i++){
        myfile.open("myfile.csv");
        for(int j = 0; j < i+1; j++)
            getline(myfile, line); 
        myfile.close();
        // do something using read data
    };
    
  3. Read all data at once (would need to store in ~30x1000 2D array as line is split by commas into array):

    string line; 
    fstream myfile;
    int numlines = 1000;
    double data[numlines][30];
    myfile.open("myfile.csv");
    for(int i = 0; i < numlines; i++){
        getline(myfile, line);
        // split by comma, store in data[][]
    } 
    myfile.close();      
    for(int i = 0; i < numlines; i++){
        // do something using data[i][]
    };
    

Are there any pitfalls here or any of the above solutions is as good as any if it works? I'm thinking that maybe keeping file in open state for a few hours is not a good idea (maybe?), but keeping a large double 2D array in memory doesn't sound right as well...

sashkello
  • 17,306
  • 24
  • 81
  • 109
  • I'm *almost* tempted to vote to close as a duplicate of: http://stackoverflow.com/questions/1567082/how-do-i-iterate-over-cin-line-by-line-in-c/1567703, since I'd say iterating is really the correct answer to the question you asked. – Jerry Coffin May 07 '13 at 03:26

1 Answers1

5

Use 1 if you can. Use 3 if you must. Never use 2.

Why? Option 1 uses only storage for a single line buffer. It traverses the file only once. Since an open file is generally not an expensive resource, it is likely to be the cheapest and simplest.

However, option 1 won't always be adequate. Sometimes you'll need to process lines in random order. Here's where option 3 is best. In this case, if there's enough memory, it's by far simplest to read the whole file and extract contents into memory. An array of strings suffices in many cases. In yours, the lines seem to contain text representations of doubles. So extracting these as you read is appropriate. In general, you want to extract in a storage- and/or access-efficient form.

If the file is so big the contents won't fit memory, then you must use random file access (fseek or seek in C++). For text lines, read through it to find the offsets of the line starts. Store these in an array to serve as a line index. Visit lines by seeking to the line start using the appropriate index entry. Then reading to the next newline. The index will be 8 bytes per line plus the buffer for a single line. If the file is really big, then you can store the index in a file and seek twice per line access. Best to put the index and data on different disk drives to reduce seek time. Another option to eliminate the index is to require that all lines have the same length, so arithmetic suffices to find any line.

Option 2 would make sense only if maintaining a single open file while you're processing a line presented an excessive cost. This will practically never be the case. Your code will have to read O(n^2) units of data for a file of n units. Very bad for performance as the problem gets bigger. Since file IO is often a bottleneck of programs, this can be very bad indeed.

Moreover, file open and close are fairly expensive operations, not to be done willy nilly. I once worked on a large simulation system and was asked to see if I could speed it up. Indeed it seemd unduly slow considering what it was doing. After a couple of weeks of reverse engineering code, I finally found that a trace file was being opened for append and closed once per iteration in the event loop. I moved the open and close outside the loop (adding an occassional flush inside the loop to replace), and whahoo! The simulation sped up by a factor of 20 or more. The client was happy to say the least.

Gene
  • 46,253
  • 4
  • 58
  • 96
  • 1
    In my experience option 3 is the fastest in terms of file reading, although more memory consuming as you already stated. Considering that the calculations seem to take much longer than reading the file I think memory might be more of an issue than performance. So +1, I think option 1 might be best here. – Excelcius May 07 '13 at 05:16