Why isn't my file corrupted while writing to it from multiple processes in Python?

Question

It seems obvious that writing from multiple process to the same file may cause corrupted data if the write() calls are not somehow synchronized. See this other question: Python multiprocessing safely writing to a file.

However, while trying to reproduce this possible bug for testing purposes, I was not able to cause the file messages to be mixed up. I wanted to do this to effectively compare with and without the lock security.

Without doing anything, file seems somehow protected.

import multiprocessing
import random

NUM_WORKERS = 10
LINE_SIZE = 10000
NUM_LINES = 10000

def writer(i):
    line = ("%d " % i) * LINE_SIZE + "\n"
    with open("file.txt", "a") as file:
        for _ in range(NUM_LINES):
            file.write(line)

def check(file):
    for _ in range(NUM_LINES * NUM_WORKERS):
        values = next(file).strip().split()
        assert len(values) == LINE_SIZE
        assert len(set(values)) == 1

if __name__ == "__main__":
    processes = []

    for i in range(NUM_WORKERS):
        process = multiprocessing.Process(target=writer, args=(i, ))
        processes.append(process)

    for process in processes:
        process.start()

    for process in processes:
        process.join()

    with open("file.txt", "r") as file:
        check(file)

I'm using Linux and I also know that file-writing may be atomic depending on the buffer size: Is file append atomic in UNIX?.

I tried to increase the size of the messages, but it doesn't produce corrupted data.

Do you know of any code sample I could use that produce corrupted files using multiprocessing on Linux?

I managed to corrupt the file when i write char-by-char, not by lines: `for c in line: file.write(c)` — Andrej Kesely, Jul 05 '19 at 09:57
@AndrejKesely In such case this is expected, I guess, as there is multiple `write` operations explicitly. I thought one call to `file.write()` wasn't even supposed to be safe. — Delgan, Jul 05 '19 at 10:04

Vladislav Ivanishin · Accepted Answer · 2019-07-05T12:11:02.477

AFAIU, the locking is done by the kernel. The reason you see the effects of locking even though you didn't ask for it is that the O_NONBLOCK file status flag is unset by default (when opening the file, I guess).

Consult the section of the manual on file status flags, in particular, see operating modes and man 2 fcntl.

I patched your example thusly to see the effects of O_NONBLOCK (and indeed, the assertion does fail now):

--- 1.py.orig   2019-07-05 14:49:13.276289018 +0300
+++ 1.py        2019-07-05 14:51:11.674727731 +0300
@@ -1,5 +1,7 @@
 import multiprocessing
 import random
+import os
+import fcntl

 NUM_WORKERS = 10
 LINE_SIZE = 10000
@@ -8,6 +10,8 @@
 def writer(i):
     line = ("%d " % i) * LINE_SIZE + "\n"
     with open("file.txt", "a") as file:
+        flag = fcntl.fcntl(file.fileno(), fcntl.F_GETFD)
+        fcntl.fcntl(file.fileno(), fcntl.F_SETFL, flag | os.O_NONBLOCK)
         for _ in range(NUM_LINES):
             file.write(line)

Credit: see e.g. this and this (and/or man 3p write).

Why isn't my file corrupted while writing to it from multiple processes in Python?

1 Answers1