Best way to save a large binary file by chunks asynchronously in C++

Question

I'm developing a C++ application whose output is a single big binary file (a couple of GBs, basically a large sequence of floats). The content of this file is generated asynchronoulsy by parallel processes.

Each time a process finishes, its result has to be saved to its corresponding position inside the binary file in the disk (the order in which processes finish does not necessarily correspond to the order in which their results are to be stored in the disk. It takes about 5 processes to get the full data for the output).

What would be the best way to achieve this in C++? I have a couple solutions that work, but maybe they can be improved in terms of minimizing disk usage:

Saving individual files for each finished process, then merging
Keeping a fstream open and positioning the put pointer for each save operation using seekp()

a couple of GBs file? why not first store it in the RAM; a large `std::vector`. and after it's filled, produce the file. — Stack Danny, Apr 10 '21 at 13:54
If it's simple to merge the files, then that's probably the way to go. Otherwise, you'll need to worry about synchronization. — asynts, Apr 10 '21 at 14:00
Do you know the exact positions in the final file in advance or do you determine them once all processes are done? Are the chunks aligned to some boundary? — rustyx, Apr 10 '21 at 14:12
Exact positions in the final file are known in advance, as well as the final size of the file — mma, Apr 10 '21 at 14:14
This is operating system specific and file system specific. My recommendation (if on Linux) would be to generate a dozen of smaller files (e.g. 100Mbytes each) or to consider using [sqlite](http://sqlite.org/) or [PostGreSQL](http://postgresql.org/)... And don't forget to *backup* that output (remotely, or on external media) — Basile Starynkevitch, Apr 10 '21 at 14:16

score 0 · Answer 1 · answered Apr 10 '21 at 15:13

I wouldn't recommend wasting time on writing to temp files and merging, if it can be avoided.

Serializing to a single process / single stream will probably be much faster. But make sure to seek-and-write in some chunks of at least 64 KB, to reduce overhead.

I wouldn't use fstreams at all as they come with some overhead (and you're dependent on the quality of implementation as is evident in 1, 2, 3, 4). Better to just use fopen, disable buffering, and write in chunks of 64 KB+.

For even better performance can use memory-mapped I/O, for example using Boost.Iostreams (example). You can memory-map from multiple processes, too.

If the fragments generated by separate processes are a multiple of 4 KB or more, on most OSes you can simply open the same file in each process, seek to the desired location and write (not very portable but OK on Linux, BSD and Win32). On Win32 just need to set file share mode accordingly.

Best way to save a large binary file by chunks asynchronously in C++

1 Answers1