0

I have 10 large (34 million cell) 2D gridded vectors storing doubles. When written they are over 200 mb in size. I use an ofstream object to write them to a text file (csv format), one element at a time, using two for loops (one for my rows, one for my columns). They take way to long to write. Is there a faster way to write from these vectors?

Here is my code:

// Resizing of vectors
flowDirGrid.resize(rows, std::vector<double>(cols, NODATA));

// Do some processing
....

// Outputting processed data
ofstream outfile3("FlowDir.dat");
if(!outfile3.good())
    return;

for(int i=0; i<rows; i++)        
    {
        for (int j=0; j<cols; j++)
        {
            if(elevation[i][j]!=NODATA)
                outfile3 << flowDirGrid[i][j]<<" ";
            else
                outfile3 << NODATA<<" ";
        }
        outfile3 << std::endl;
    }

outfile3.close(); 

I am using C++ and Visual Studio 2012.

EDIT:. I have removed all the std::endl instances and replaced with "\n" and it still takes 17 minutes to write each output file. I may move to using the recommended C method.

Would using a ternary instead of an if-else ladder speed it up at all?

traggatmot
  • 1,423
  • 5
  • 26
  • 51
  • Do you really need CSV? – Karoly Horvath Aug 15 '15 at 07:03
  • Yes, I need to load these files into excel and/or GIS, which requires comma delimited files. – traggatmot Aug 15 '15 at 07:04
  • 2
    If `fstream` is too slow, you can test if using lower level `std::snprintf` to a char buffer, and then using `std::fwrite` to write buffer contents out speeds it up. Or if you are ok with using extra libraries, you could look for example at [The Boost Format library](http://www.boost.org/doc/libs/1_55_0/libs/format/doc/format.html) and see if they perform better. – hyde Aug 15 '15 at 07:09
  • What is "way too long"? Is it any slower than, say, copying a 200MB file? Seems likely that'll you'll be bottlenecked at file IO. – Mud Aug 15 '15 at 07:24
  • Wouldn't that question be better suited to code review site? – Daniel Jour Aug 15 '15 at 07:31
  • 1
    Too long is 10-20 minutes, depending on the machine. Sometimes the code this is part of needs to be run 10-100 times in succession for calibration of the processing functions. So every minute I save in 1 run, could save 10 to 100 minutes in some runs. – traggatmot Aug 15 '15 at 07:31
  • 1
    You could check out http://stackoverflow.com/a/11564931/19254 and http://codereview.stackexchange.com/q/80720/77127 to see if they help. – Reunanen Aug 15 '15 at 07:38
  • 1
    First thing to do when output seems to be slow is to only flush the buffer at the end, rather than in the loop. `std::endl` is basically a return AND flush of the output buffer. – abort Aug 15 '15 at 08:00
  • 1
    If you do not need the ability to use rows of different sizes, consider using a contiguous data structure – M.M Aug 15 '15 at 10:29
  • As others have said, you might want to write "\n" in most places you're writing std::endl. You might also consider saving the data as you're processing it as raw doubles, then when the processing is done use a different program to convert the raw doubles to .CSV format. Maybe Excel and that other app also support reading of kinds of files that are more efficient than CSV? If you don't need the precision of doubles, you might use floats instead? Finally, while std::vector is most efficient, it's also contiguous. If you don't need that maybe std::deque objects might be better? –  Aug 15 '15 at 14:18
  • I have removed all the std::endl and replaced with "\n" and it still takes 17 minutes to write each output. – traggatmot Aug 19 '15 at 00:58

3 Answers3

2

The C++ iostream library is convenient, but very slow. fprintf will serve you better. Also, use '\n' instead of endl, as the latter forces the stream to flush.


#include <cstdio>
#include <chrono>
#include <fstream>
#include <iostream>
#include <random>
#include <vector>

using namespace std;
using namespace std::chrono;

void PrintC(const double * data, size_t n, const char * path)
{
    FILE * f = fopen(path, "w");
    for (size_t i(0); i != n; ++i)
        fprintf(f, "%lf ", data[i]);
    fclose(f);
}

void PrintCpp(const double * data, size_t n, const char * path)
{
    ofstream f(path);
    for (size_t i(0); i != n; ++i)
        f << data[i] << ' ';
}

template<typename PrintT>
void Time(const vector<double> & data, PrintT Print, const char * path, const char * text)
{
    auto s = steady_clock::now();
    Print(data.data(), data.size(), path);
    auto f = steady_clock::now();

    cout << text << ": " << duration_cast<duration<double>>(f - s).count() << endl;
}

int main()
{
    vector<double> data(34000000);
    default_random_engine generator;
    uniform_real_distribution<double> distribution(0.0, 1.0);
    for (size_t i(0); i != data.size(); ++i)
        data[i] = distribution(generator);

    Time(data, PrintC,   "test1.dat", "c");
    Time(data, PrintCpp, "test2.dat", "c++");
}

Visual Studio 2013 Professional, release configuration:

c: 17.2682
c++: 32.0839

Don Reba
  • 13,814
  • 3
  • 48
  • 61
  • 1
    Your statements are mostly wrong. iostream can be as fast as the C API. – edmz Aug 15 '15 at 08:09
  • I bet you'll see a sensible speed-up by turning off the synchronization with stdio (std::ios_base::sync_with_stdio(false)). – edmz Aug 15 '15 at 13:25
  • 1
    No difference. Looking at the standard library code, it sets a flag called `_Sync`, which isn't actually used anywhere. – Don Reba Aug 15 '15 at 13:52
1

20 minutes for 200 MB seems really long. You have a performance problem so you must test the elements in order.

  1. take one 200 mb and copy it (directly at os level). If it takes about 10 minutes the bottleneck is here: buy a faster disk

  2. Make a test program that generates a random set of value (not all 0. because 0. is simple to convert than other double values, so just *different values), write the time (at least including second) use above code to write the file and write the time again - alternatively, you could use current code just adding the two times. Run it several times. If it takes significantly longer than the fist test, report here with the times for first test and for second.

  3. If none of the above tests have taken about 10 minutes the problem is in the remaining code...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

According to this answer std::endl causes flushing the stream. Try using \n instead.

Community
  • 1
  • 1
Peopleware
  • 1,399
  • 1
  • 25
  • 42