3

In Python, how I can safely overwrite a file without renaming it?

There is a question on SO:

where this topic is discussed, but provided solutions can't help me, because in my case I have many hardlinks pointing to the file being overwritten.

Is there any other method that can guarantee me atomic change of the file (without renaming it)?

Thank you very much!

Community
  • 1
  • 1
Igor Chubin
  • 61,765
  • 13
  • 122
  • 144

2 Answers2

2

Python gives you access to underlying OS tools. Please review Atomic operations in UNIX.

Overall you have two requirements, atomicity and support for hard links. Also the referred answer mentions safety.

First is very narrowly satisfiable, but only if you drop safety, typically you'd use POSIX advisory locks, if every client uses these, you can have a very robust system, for example sqlite.

Mandatory locking is available, but not commonly enabled. Main sticking point with mandatory locks is priority inversion, that is non-privileged user can block a root process if they access same file.

Hard links implies you have to work on inode level. Any function in the above reference that operates on a file descriptor will work.

Atomic but not safe

A single write system call is atomic up to a certain filesystem-dependent threshold. If you can afford to buffer your file data in memory (anonymous or mapped), you can atomically overwrite the file. For the sake of simplicity let's assume the file size is fixed.

Consider code below, it when two processes perform this action simultaneously, both writes start at offset 0, run in a single system call and in the end only one write "wins".

#!/usr/bin/env python
import sys

data = open(sys.argv[1], "rb").read()
fo = open(sys.argv[2], "rb+")
fo.seek(0)
fo.write(data)

While this is atomic, it is not inherently safe. write could turn out to be partial (typically only if disk is full), or operating system could crash during write, leaving you with a target file that is neither source a nor b. If that's acceptable because you made a backup, do ahead and use it :)

P.S. If file size if not fixed, adopt a file format where file header specifies data size if the file.

P.P.S. Although sendfile system call now works on regular files for both input and output, testing shows that operation is not atomic, here one thread tried to send 1000M zeros and another 1000M ff's, the result is exactly 1000M but data gets interleaved, return value of one sendfiles shows partial write, but size if inconsistent with actual zeros written:

(env33)[dima@bmg ~]$ hexdump oux 
0000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
03c0000 0000 0000 0000 0000 0000 0000 0000 0000
*
3e800000
Community
  • 1
  • 1
Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
  • Using FFmpeg, I was trying to stream a continuously updating image file. However this would fail because as I learnt later, the image update operation had to be atomic. To make it so, according to general advice, on image-update being detected, I had to copy the image to a temporary one and then rename it to the one being read by FFmpeg. I tried doing that with the following approaches: using `shutil.copyfile`+`os.rename`, using `shutil.move`, using the `atomic_writes` library... and none of them worked. On the contrary, your simple solution of operation decomposing brilliantly did. Thank you! – Redoman Mar 26 '22 at 04:16
2

It is not possible in general. Atomic replacement of a file's contents is not a feature offered by most filesystems.

If you can design the file format then you can arrange to make a series of changes to it, such that in case of failure the code that reads the file knows how far you got before you failed. It's fairly difficult to make this both correct and efficient. I know that database engines pull some advanced tricks to ensure safe writing of data, but I'm afraid I don't know the details.

As a crude first attempt:

  • put a header at the start of the file that says what offset to start reading the "current" data, and its length.
  • to update the file you first append the new data, do a flush and fsync, then update the header to indicate where the current data is now.

The update of the header itself is still not guaranteed atomic, but since it's a small amount of data it's likely that nothing short of the power being interrupted will prevent it happening all-or-nothing. And maybe not even that. I suppose you could use a filename to contain the necessary offsets, since that's something you can update atomically on reputable filesystems.

Of course the file grows indefinitely. So in the case where your new data is smaller than the offset at which the current data starts, you would re-use the portion of the file before the region marked "in use" in the header instead of appending.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699