2

I am looking at a way to allow concurrent file object seeking.

As a test case of file seeking going wary:

#!/usr/bin/env python2
import time, random, os
s = 'The quick brown fox jumps over the lazy dog'

# create some file, just for testing
f = open('file.txt', 'w')
f.write(s)
f.close()

# the actual code...
f = open('file.txt', 'rb')
def fn():
    out = ''
    for i in xrange(10):
        k = random.randint(0, len(s)-1)
        f.seek(k)
        time.sleep(random.randint(1, 4)/10.)
        out += s[k] + ' ' + f.read(1) + '\n'
    return out

import multiprocessing
p = multiprocessing.Pool()
n = 3
res = [p.apply_async(fn) for _ in xrange(n)]
for r in res:
    print r.get()
f.close()

I have worker processes, which do random seeking within the file, then sleep, then read. I compare what they read to the actual string character. I do not print right away to avoid concurrency issues with printing.

You can see that when n=1, it all goes well, but everything goes astray when n>1 due to concurrency in the file descriptor.

I have tried to duplicate the file descriptor within fn():

def fn():
    fd = os.dup(f)
    f2 = os.fdopen(fd)

And then I use f2. But it does not seem to help.

How can I do seeking concurrently, i.e. from multiple processes? (In this case, I could just open the file within fn(), but this is a MWE. In my actual case, it is harder to do that.)

Ricardo Magalhães Cruz
  • 3,504
  • 6
  • 33
  • 57

1 Answers1

0

You cannot - Python I/O builds on C's I/O, and there is only one "current file position" per open file in C. That's inherently shared.

What you can do is perform your seek+read under protection of an interprocess lock.

Like define:

def process_init(lock):
    global seek_lock
    seek_lock = lock

and in the main process add this to the Pool constructor:

initializer=process_init, initargs=(multiprocessing.Lock(),)

Then whenever you want to seek and read, do it under the protection of that lock:

with seek_lock:
     f.seek(k)
     char = f.read(1)

As with any lock, you want to do as little as logically necessary while it's held. It won't allow concurrent seeking, but it will prevent seeks in one process from interfering with the seeks in other processes.

It would, of course, be better to open the file in each process, so that each process has its own notion of file position - but you already said you can't. Rethink that ;-)

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • Ah - I see from the [C `dup()` manual](http://man7.org/linux/man-pages/man2/dup.2.html)... "They refer to the same open file description and thus share file offset and file status flags" – Ricardo Magalhães Cruz Jun 06 '16 at 22:14
  • Not good for multiprocessing ;-) http://stackoverflow.com/questions/11635219/dup2-dup-why-would-i-need-to-duplicate-a-file-descriptor – Tim Peters Jun 06 '16 at 23:11
  • Okay, I remember using `dup2()` in college when building a shell, and I guess I have used `dup()` as well. It seems mostly used as a kind of reference counting system to avoid the actual file channel being closed. – Ricardo Magalhães Cruz Jun 07 '16 at 09:27