5

It seems obvious that writing from multiple process to the same file may cause corrupted data if the write() calls are not somehow synchronized. See this other question: Python multiprocessing safely writing to a file.

However, while trying to reproduce this possible bug for testing purposes, I was not able to cause the file messages to be mixed up. I wanted to do this to effectively compare with and without the lock security.

Without doing anything, file seems somehow protected.

import multiprocessing
import random

NUM_WORKERS = 10
LINE_SIZE = 10000
NUM_LINES = 10000

def writer(i):
    line = ("%d " % i) * LINE_SIZE + "\n"
    with open("file.txt", "a") as file:
        for _ in range(NUM_LINES):
            file.write(line)

def check(file):
    for _ in range(NUM_LINES * NUM_WORKERS):
        values = next(file).strip().split()
        assert len(values) == LINE_SIZE
        assert len(set(values)) == 1

if __name__ == "__main__":
    processes = []

    for i in range(NUM_WORKERS):
        process = multiprocessing.Process(target=writer, args=(i, ))
        processes.append(process)

    for process in processes:
        process.start()

    for process in processes:
        process.join()

    with open("file.txt", "r") as file:
        check(file)

I'm using Linux and I also know that file-writing may be atomic depending on the buffer size: Is file append atomic in UNIX?.

I tried to increase the size of the messages, but it doesn't produce corrupted data.

Do you know of any code sample I could use that produce corrupted files using multiprocessing on Linux?

Delgan
  • 18,571
  • 11
  • 90
  • 141
  • I managed to corrupt the file when i write char-by-char, not by lines: `for c in line: file.write(c)` – Andrej Kesely Jul 05 '19 at 09:57
  • @AndrejKesely In such case this is expected, I guess, as there is multiple `write` operations explicitly. I thought one call to `file.write()` wasn't even supposed to be safe. – Delgan Jul 05 '19 at 10:04

1 Answers1

2

AFAIU, the locking is done by the kernel. The reason you see the effects of locking even though you didn't ask for it is that the O_NONBLOCK file status flag is unset by default (when opening the file, I guess).

Consult the section of the manual on file status flags, in particular, see operating modes and man 2 fcntl.

I patched your example thusly to see the effects of O_NONBLOCK (and indeed, the assertion does fail now):

--- 1.py.orig   2019-07-05 14:49:13.276289018 +0300
+++ 1.py        2019-07-05 14:51:11.674727731 +0300
@@ -1,5 +1,7 @@
 import multiprocessing
 import random
+import os
+import fcntl

 NUM_WORKERS = 10
 LINE_SIZE = 10000
@@ -8,6 +10,8 @@
 def writer(i):
     line = ("%d " % i) * LINE_SIZE + "\n"
     with open("file.txt", "a") as file:
+        flag = fcntl.fcntl(file.fileno(), fcntl.F_GETFD)
+        fcntl.fcntl(file.fileno(), fcntl.F_SETFL, flag | os.O_NONBLOCK)
         for _ in range(NUM_LINES):
             file.write(line)

Credit: see e.g. this and this (and/or man 3p write).

Vladislav Ivanishin
  • 2,092
  • 16
  • 22