1

Suppose a program has a caching mechanism where, at the end of some specific calculation, the program writes the output of that calculation to the disk to avoid re-computing it later, when the program is re-ran. It does so for a large number of calculations, and saves each output to separate files (one per calculation, with filenames determined by hashing the computation parameters). The data is written to the file with standard C++ streams:

    void* data = /* result of computation */;
    std::size_t dataSize = /* size of the result in bytes */;
    std::string cacheFile = /* unique filename for this computation */;

    std::ofstream out(cacheFile, std::ios::binary);
    out << dataSize;
    out.write(static_cast<const char *>(data), dataSize);

The calculation is deterministic, hence the data written to a given file will always be the same.

Question: is it safe for multiple threads (or processes) to attempt this simultaneously, for the same calculation, and with the same output file? It does not matter if some threads or processes fail to write the file, as long as at least one succeeds, and as long as all programs are left in a valid state.

In the manual tests I ran, no program failure or data corruption occurred, and the file was always created with the correct content, but this may be platform-dependent. For reference, in our specific case, the size of the data ranges from 2 to 50 kilobytes.

  • What do you mean with safe? When two threads write the same data into a file, the data is written twice. If you write at the same time, the order is unpredicatble. E.g. both threads write "123\n". The result could be "1123\n23\n" or "123\n123\n" or "123123\n\n" or .... – Thomas Sablik Apr 21 '20 at 09:17
  • why do you want to do this? I would rather make sure that one threads sucessfully writes and the others not. Currently this sounds like a nightmare for further processing of the file when you have to first reconstruct its contents into something meaningful – 463035818_is_not_an_ai Apr 21 '20 at 09:21
  • @ThomasSablik That would only happen if the file was open with ``std::ios::app``, wouldn't it? – Corentin Schreiber Apr 21 '20 at 09:22
  • @idclev463035818 This is used to cache JIT-compiled code to disk, to improve runtime performance. As for making sure only one thread writes to the disk, this is possible within a single program with a mutex, but not if this is executed in multiple process. – Corentin Schreiber Apr 21 '20 at 09:25
  • @CorentinSchreiber I tried it and you are right. It doesn't mix up the outputs without `std::ios::app` on my system (Linux, ext4, gcc) – Thomas Sablik Apr 21 '20 at 09:45
  • So, if some other thread has already computed and written out that which the current thread is about to compute, why not just reuse that prior result instead of computing/writing it out again? – 500 - Internal Server Error Apr 21 '20 at 11:30
  • @500-InternalServerError This is indeed what happens in most cases, but the concern here is the (rare but possible) case when two threads want that result at the same time and the result has not yet been saved on disk. The two threads will have no choice but to run the computation separately, and they may attempt to write down the result at the same time. – Corentin Schreiber Apr 21 '20 at 12:20

2 Answers2

3

is it safe for multiple threads (or processes) to attempt this simultaneously, for the same calculation, and with the same output file?

It is a race condition when multiple threads try to write into the same file, so that you may end up with a corrupted file. There is no guarantee that ofstream::write is atomic and that depends on a particular filesystem.

The robust solution for your problem (works both with multiple threads and/or processes):

  1. Write into a temporary file with a unique name in the destination directory (so that the temporary and the final files are in the same filesystem for rename to not move data).
  2. rename the temporary file to its final name. It replaces the existing file if one is there. Non-portable renameat2 is more flexible.
Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • Thanks, this is probably the solution we will adopt as it only requires portable functions. Do you have references for your claims in the first sentence? – Corentin Schreiber Apr 22 '20 at 07:52
  • @CorentinSchreiber That's common knowledge. No standard requires filesystem writes to write the entire buffer and be atomic at the same time. Although that is a highly desirable feature. – Maxim Egorushkin Apr 22 '20 at 08:57
0

It is possible to synchronise threads within the same process to write to one file using thread synchronisation. However, this isn't possible between different processes, so it is better to avoid it. There isn't anything in the C++ standard library that you can use for that.

Operating systems do provide special functions for locking files that are guaranteed to be atomic (like lockf on Linux or LockFile(Ex) on Windows). You might like to check them out.

jignatius
  • 6,304
  • 2
  • 15
  • 30