0

I have a code written in c++ reading a very big data file (10-20 go). I read each line and it is rather long. Is there any way to improve the effiencicy ?

I know that there is some post about this but my problem is not excatly the same...

The file is containing coordinates of N atoms and their velocity at given times.

My code :

void Funct(std::string path, float TimeToRead, int nbLines, float x[], float y[], float z[], float vx[], float vy[], float vz[], std::string names[], int index[])
{
    ifstream file(path.c_str());
    if (file)
    {
        /* x,y,z are arrays like float x[nbAtoms] */

        while (time != TimetoRead) {
            /*I Put the cursor at the given time to read before*/
            /*And then read atoms coordinates*/
        }

        for (int i = 0; i < nbAtoms; i++) {
            file >> x[i]; file >> y[i]; /* etc, load all*/
        }
    }
}

int main()
{
    /*Declarations : hidden*/

    for (int TimeToRead = 0; TimeToRead<finalTime; TimeToRead++) {
        Funct(...);
        /*Do some Calculus on the atoms coordinates at one given time */
    }
}

Currently i have around 2million lines with 8 or 9 column of number each. The file is a succesion of atom coordinates at one given time.

I have to do calculation on each time step, so i am now calling this function for every time step (around 4000 time steps and there is a large amount of atoms). At the end is very expensive in time.

I have read somewhere that i could save in memory in one row and not read file every time but when the file is 20Go i cannot really save it all in RAM!

What can i do to improve this reading ?

Thanks you very much

Edit1: I am on Linux

Edit2: The file to read contains a line header like :

time= 1
coordinates atom 1
coordinate atom 2
...
...
...
time=2
coordinates atom 1
coordinate atom 2
...
...
...
etc

the while loop is just reading each line since the begining until he finds the t= TimeToRead

Minamoto
  • 13
  • 6
  • You could use a memorymap....using MapViewOfFile....then Windows memory manager efficiently manages the buffering, and caching of that data when you access it. http://stackoverflow.com/questions/10836609/fastest-technique-to-read-a-file-into-memory – Colin Smith Apr 26 '17 at 17:18
  • at least you do not need to open the file for each iteration. Keep it open and continue reading in the next iteration. Wont save you much though – 463035818_is_not_an_ai Apr 26 '17 at 17:19
  • 1
    It seems like you are opening the file and iterating through it for the starting point each time. Perhaps building an index of all starting points would save you some execution time. If the file contains data sorted by time, you could also save your last cursor position as your next starting point. – François Andrieux Apr 26 '17 at 17:22
  • I think you can save the result in file after you do each iteration – keronconk Apr 26 '17 at 17:22
  • 1
    It's unlikely that your C++ program can make your hard drive spin any faster. Your performance is limited entirely by your system's I/O performance, and there is absolutely nothing, whatsoever, that can be done about it, short of getting a faster hard drive. Or, perhaps reengineering whatever you're trying to do, so that whatever you're trying to do no longer involves reading a 20gb file. – Sam Varshavchik Apr 26 '17 at 17:29
  • 1
    Is the file in time order? If so, just keep it open and track the last timestamp read. The next timestamp will either be the next one in the file, or you'll have to skip some - there's no point re-reading the whole thing to get to the same place. Or, at least, remember where you were up to (your last offset) and seek straight there. – Useless Apr 26 '17 at 17:40

1 Answers1

3

I think there is potential in optimizing (removing) the line skipping code (while (time != TimetoRead))

You open your file in every iteration, and then you skip looots of lines all the time. If your file contains finalTime records, you skip 0 records at the first iteration, 1 record at the second, etc. In total you skip 0+1+2+...(finalTime-1) records, that's (finalTime-1)*(finalTime)/2 :-) Mutliply this by the lines per record and you'll see where a big portion of your time might be lost.

A solution could be: Extract the file open operation from your read method to the surrounding code. That way you read a record, then do your calculus, and then when you read the next record you don't have to open the file again and skip all those lines, since the stream will automatically continue on the right position.

That should look like this in "pseudo code":

void Funct(ifstream file, ...)
{
    if (file)
    {
        /* x,y,z are arrays like float x[nbAtoms] */

        for (int i = 0; i < nbAtoms; i++) {
            file >> x[i]; file >> y[i]; /* etc, load all*/
        }
    }
}

int main()
{
    ifstream file(path.c_str());

    for (int TimeToRead = 0; TimeToRead<finalTime; TimeToRead++) {
        Funct(file, ...);
        /*Do some Calculus on the atoms coordinates at one given time */
    }
}
Stefan Woehrer
  • 680
  • 4
  • 12