6

In my current project I'm dealing with a big amount of data which is being generated on-the-run by means of a "while" loop. I want to write the data onto a CSV file, and I don't know what's better - should I store all the values in a vector array and write to the file at the end, or write in every iteration?

I guess the first choice it's better, but I'd like an elaborated answer if that's possible. Thank you.

Arnaugir
  • 108
  • 1
  • 7
  • If you have a large amount of data, then you will need to use lots of memory to store in the vector, so it's a bad idea. Also don't use "endl" when writing to files, use \n instead. – Neil Kirk Mar 27 '14 at 16:51
  • @NeilKirk: If performance is really important, `fopen` and `fprintf` beat `ofstream`, and then `std::endl` becomes a non-issue. – Ben Voigt Mar 27 '14 at 16:53
  • 1
    @BenVoigt Why, what's wrong with ofstream? – Neil Kirk Mar 27 '14 at 16:53
  • @NeilKirk I guess that he is concerned about a little overhead due to more calls and formatting maybe, but I would still prefer I/O abstraction over some meaningless performance difference. – Sebastian Hoffmann Mar 27 '14 at 16:54
  • @NeilKirk: Excessive use of mutexes. Virtual calls caused by unneeded customization hooks. Stupidity in common implementations. See http://stackoverflow.com/questions/4340396/does-the-c-standard-mandate-poor-performance-for-iostreams-or-am-i-just-deali The bottom line is that `ofstream` formatting is slower than most modern disk drives. – Ben Voigt Mar 27 '14 at 16:56
  • @BenVoigt That link doesn't mention ofstream. – Neil Kirk Mar 27 '14 at 16:58
  • @BenVoigt Disagree. Source? – Neil Kirk Mar 27 '14 at 17:02
  • @BenVoigt I didn't say that ostringstream can't. ofstream is not based on ostringstream, as far as I can tell from the spec and Visual Studio's library code. – Neil Kirk Mar 27 '14 at 17:04
  • @Paranaix: Expect to find people who disagree that a factor of 10x performance difference is "meaningless". – Ben Voigt Mar 27 '14 at 17:08
  • @Neil: There's plenty of real-world performance data showing that stdio is *much* faster and uses less CPU than iostreams. For example http://stackoverflow.com/a/11564931/103167 – Ben Voigt Mar 27 '14 at 17:18
  • @BenVoigt I did a quick test and you are right, FILE is faster than ofstream. However, I only noticed factor of 2x-3x on my machine. – Neil Kirk Mar 27 '14 at 17:25
  • @Neil: Sure, the exact ratio depends on CPU specs, disk specs, what else is using the CPU, what else is using the disk. In an extreme scenario, you could see 40x difference, or no difference at all. But even when you get the same transfer rate from iostreams (for example, an SD card that's limiting the speed), you might be unhappy that it's taking 10 times as many CPU cycles in the process (taking time away from other threads, or just wasting battery power) – Ben Voigt Mar 27 '14 at 17:32
  • @BenVoigt Thank you for this information. I actually do a lot with huge files and I have been using fstream. – Neil Kirk Mar 27 '14 at 17:33

3 Answers3

3

Make sure that you're using an I/O library with buffering enabled, and then write every iteration.

This way your computer can start doing disk access in parallel with the remaining computations.

PS. Don't do anything crazy like flushing after each write, or opening and closing the file each iteration. That would kill efficiency.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Thank you, I already had this in mind. I'm using ofstream, is that a good choice? (I'm kind of noob with c++) – Arnaugir Mar 27 '14 at 16:53
  • 1
    @Arnaugir: `ofstream` does have buffering, as long as you avoid `flush` and `endl`. But it's also got a lot more overhead than `fopen`+`fprintf`. If you don't need locale-specific formatting (CSV is intended to be computer-readable, so you usually don't want that anyway) then I'd definitely suggest `fprintf`. – Ben Voigt Mar 27 '14 at 16:54
  • I thought CSV is supposed to be human-readable. I edit them in text editors all the time. – Neil Kirk Mar 27 '14 at 16:55
  • @NeilKirk: CSV is a computer-readable text format. Being text gives it some human-readability. But being computer-readable is what restricts the format. You don't want 100456/100 written out as `1004,56` or `1,004.56`. You want to ignore locale settings, and write `1004.56` – Ben Voigt Mar 27 '14 at 16:59
0

The most efficient method to write to a file is to reduce the number of write operations and increase the data written per operation.

Given a byte buffer of 512 bytes, the most inefficient method is to write 512 bytes, one write operation at a time. A more efficient method is to make one operation to write 512 bytes.

There is overhead associated with each call to write to a file. That overhead consists of locating the file on the drive in it's catalog, seeking to the a new location on the drive and writing. The actual operation of writing is quite fast; it's this seeking and waiting for the hard drive to spin up and get ready that wastes your time. So spin it up once, keep it spinning by writing a lot of stuff, then let it spin down. The more data written while the platters are spinning the more efficient the write will be.

  • Yes, there are caches everywhere along the data path, but all that will be more efficient with large data sizes.

I would recommend writing the the formatted to a text buffer (that is a multiple of 512), and at certain points, flush the buffer to the hard drive. (512 bytes is a common sector size multiple on hard drives).

If you like threads, you can create a thread that monitors the output buffer. When the output buffer reaches a threshold, the thread writes the contents to drive. Multiple buffers can help by having the fast processor fill up buffers while other buffers are written to the slow drive.

If your platform has DMA you might be able to speed things up by having the DMA write the data for you. Although I would expect a good driver to do this automatically.

I do use this technique on an embedded system, using a UART (RS232 port) instead of a hard drive. By using the buffering, I'm able go get about 80% efficiency.
(Loop unrolling may also help.)

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • That would minimize the total time the drive is spinning, but you're wrong when you say "waiting for the hard drive.... wastes your time". These are writes. They get put into a write buffer and the program happily continues which the seeking happens. Writes happen asynchronously, whether using interrupt handlers or threads is not really important, since the OS handles that. – Ben Voigt Mar 27 '14 at 20:16
  • @BenVoigt: There is always a limit to all the buffers. At some point, the processor has spend time copying data from a buffer to the output port, usually by swapping execution time from your program to the OS. Drives with faster seek and startup times will perform better and your program will run faster. Again, *something* has to monitor the output buffers. – Thomas Matthews Mar 27 '14 at 20:20
  • But time spent copying to the output port is completely unrelated to drive seek time. Either you're generating data faster than the disk can accept it, in which case you end up with large transfers quite naturally, or else you transmit a block into the disk's writeback buffer, it spins and seeks and does all the physical things needed to write data without bothering your CPU in any way, and when the CPU completes the next block, it finds an empty writeback buffer to fill. In the latter case the seek time doesn't matter. Of course the I/O library should be buffered and not write single bytes – Ben Voigt Mar 27 '14 at 20:40
-1

The easiest way is in console with > operator. In linux:

./miProgram > myData.txt

Thats get the input of the program and puts in a file.

Sorry for the english :)