6

I am implementing a mutual exclusion mechanism based on a file lock. Other instances of my script know that they are not supposed to run when they come accross a specific file, which is locked.

In order to achieve this, I have created and locked the file using fcntl.flock. When I release the lock I also want to clean up the file, that it doesn't sit there indicating an old pid when no process is actually running.

My question is, when and how should I clean up the file, especially at what point can I delete it. Basically I see two options:

  • truncate and delete the file before the lock is released
  • truncate and delete the file after the lock is released

From my understanding, each one exposes my application to slightly different race conditions. What is best practice, what have I missed?

Here's an (overly simplified) Example:

import fcntl
import os
import sys
import time

# open file for read/write, create if necessary
with open('my_lock_file.pid', 'a+') as f:
    # acquire lock, raises exception if lock is hold by another process
    try:
        fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
    except IOError:
        print 'other process running:', f.readline()
        sys.exit()

    try:
        # do something
        f.write('%d\n' % os.getpid())
        f.flush()
        # more stuff here ...
        time.sleep(5)
    finally:
        # clean up here?
        # release lock
        fcntl.flock(f, fcntl.LOCK_UN)
        # clean up here?
# clean up here?
moooeeeep
  • 31,622
  • 22
  • 98
  • 187

2 Answers2

3

I found this related question to give some suggestions about how to handle this case:

It also made me aware of another possible race condition, which occurs when another process deletes the file just after it's been opened by the current process. This would cause the current process to lock a file that is no longer to be found on the filesystem and thus fail to block the next process that would create it anew.

There I found the suggestion to make use of the open flag O_EXCL for atomic exclusive file creation, which is exposed via the os.open() function for low-level file operations. I then implemented the following example accordingly:

import os
import sys
import time

# acquire: open file for write, create if possible, exclusive access guaranteed
fname = 'my_lock_file.pid'
try:
    fd = os.open(fname, os.O_CREAT|os.O_WRONLY|os.O_EXCL)
except OSError:
    # failed to open, another process is running
    with open(fname) as f:
        print "other process running:", f.readline()
        sys.exit()

try:
    os.write(fd, '%d\n' % os.getpid())
    os.fsync(fd)
    # do something
    time.sleep(5)
finally:
    os.close(fd)
    # release: delete file
    os.remove(fname)

After implementing this, I found out that this is exactly the same approach the lockfile module uses for its pid files.

For reference:

Community
  • 1
  • 1
moooeeeep
  • 31,622
  • 22
  • 98
  • 187
  • 2
    If the script exits abnormally during the "do something" step, the lock file won't be removed. All future attempts to run the script will print the "other process running" message because `os.O_EXCL` causes an error to occur if the file already exists. For this reason, other methods like the process lock provided in the [fasteners](http://fasteners.readthedocs.io/en/latest/api/process_lock.html) package do not clean up the lock file. If you want to allow for the lock file to be removed, I think you need to stat the file after acquiring the lock as was suggested in the answer you linked to. – ws_e_c421 Jun 26 '17 at 03:51
  • @ws_e_c421 Indeed, I noticed that the `lockfile` module was deprecated in the meantime. `fasteners` just use the approach given in the OP (based on `fcntl.flock()`) and don't bother to clean up (truncate and delete) the file. – moooeeeep Mar 10 '20 at 08:31
  • Reading it again after two years, it took me a while to understand my comment. By exit abnormally, I meant hitting a seg fault in a Python C extension, so the `finally` block never gets to remove the lock file. This is a situation I ran into personally. Also, my one point about cleanup is that `fasteners` does not try to use remove the lockfile, just release the lock with `fcntl.flock()` and close the file descriptor. – ws_e_c421 Mar 15 '20 at 18:39
3
  1. in unix it is possible to delete a file while it is opened - the ino will be kept until all processes have ended that have it in their file descriptor list
  2. in unix it is possible to check that a file has been removed from all directories by checking the link count as it becomes zero

When lockfiles get created each time then this may be a valid solution under unix:

import fcntl
import os
import sys
import time

# open file for read/write, create if necessary
for attempt in xrange(timeout):
    with open('my_lock_file.pid', 'a+') as f:
        # acquire lock, raises exception if lock is hold by another process
        try:
            fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
            st = os.fstat(f)
            if not st.st_nlink:
                print 'lock file got deleted'
                continue # try again
        except BlockingIOError:
            print 'other process running:', f.readline()
            time.sleep(1)
            continue # try again

        try:
            # do something
            f.write('%d\n' % os.getpid())
            f.flush()
            # more stuff here ...
            time.sleep(5)
        finally:
            # clean up here
            os.unlink('my_lock_file.pid')
            # release lock
            fcntl.flock(f, fcntl.LOCK_UN)
    return True
return False

Homework: explain why the traditional name for remove-file is named "unlink" in unix.

Guido U. Draheim
  • 3,038
  • 1
  • 20
  • 19
  • 1
    This is going to work when the file gets removed but not when it gets renamed. In order to account for that, after flock succeeds stat the file _by_name_ again and compare the inode number with inode number obtained from the existing stat on the flocked file handle. If the numbers are the same - success. If not - try the whole thing again. – oᴉɹǝɥɔ Dec 02 '20 at 05:37
  • I think time.sleep(1) is unnecessary, as next time flock will do the wait – leavez Mar 13 '21 at 16:11