1

I have written this small example:

import multiprocessing
from functools import partial

def foo(x, fp):
    print str(x) + " "+ str(fp.closed)
    return

def main():
    with open("test.txt", 'r') as file:
        pool = multiprocessing.Pool(multiprocessing.cpu_count())
        partial_foo = partial(foo, fp=file)
        print file.closed
        pool.map(partial_foo, [1,2,3,4])
        pool.close()
        pool.join()
        print file.closed
        print "done"

if __name__=='__main__':
    main()

Which will print:

False
2 True
3 True 
1 True
4 True
False
done

My question would be why are file handles closed for the child process and how would I keep them open such that every process can work with the file?

Since it was asked in the comments:

$ uname -a && python2.7 -V
Linux X220 3.17.6-1-ARCH #1 SMP PREEMPT Sun Dec 7 23:43:32 UTC 2014 x86_64 GNU/Linux
Python 2.7.9
ap0
  • 1,083
  • 1
  • 12
  • 37
  • What OS are you using? What version of Python? – dano Dec 17 '14 at 17:40
  • @dano, Linux. Arch Linux to be exact running Python2.7. – ap0 Dec 17 '14 at 17:41
  • Related question: http://stackoverflow.com/questions/1075443/share-objects-with-file-handle-attribute-between-processes. In general, you probably *don't* want to do this, though there are platform-specific ways you can do it. What do you actually want each child to do with the fd? – dano Dec 17 '14 at 17:57
  • @dano, I try to explain it shortly. The iterable list given to pool.map function is a number of blocks/clusters which I want the function to analyze. So I thought this way would be better than opening and closing the file with every running process. – ap0 Dec 17 '14 at 17:59
  • @dano how can multiprocessing prevent specifically the duplication of file descriptors in it's implicit `fork`s? – Reut Sharabani Dec 17 '14 at 18:09
  • I wasn't sure to delete my question or flag as duplicate as dano showed me. So I just flaged it. It may deleted if this would be better. I will look for another solution. – ap0 Dec 17 '14 at 18:28
  • @ReutSharabani In Python 3.4+ you can use the [context](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) feature to use a method other than `fork` to create the child processes. Prior to that, all you can do is try to keep fd's out of global state, or explicitly close open fd's in the child process after the duplication has occurred. – dano Dec 17 '14 at 18:32
  • @ap0 Your best option is probably to just explicitly open the file in each child. – dano Dec 17 '14 at 18:33
  • @dano, I was thinkging about using Queue. Wouldn't this be better? I let the parent process read from the file, feed it into the Queue and let the child process work with that. Or is opening and closing the file in every child better in your opinion? – ap0 Dec 17 '14 at 18:35
  • @ap0 Yes, you can do that, too. You pay the IPC cost of sending the data from the file between processes doing it that way, but that may end up being cheaper than having all the child processes trying to read different parts of the same file simultaneously. – dano Dec 17 '14 at 18:37
  • @dano, oh, so there is no problem when having 4 handles open to the same file? – ap0 Dec 17 '14 at 18:38
  • 1
    @ap0 It will work, but it may not perform very well, because all the processes will be trying to read from the disk at the same time. Your HDD can only read from one location at once, so it will end up skipping back and forth between different locations of the disk to read data for each process. That might end up being slower than just reading the file sequentially one time, and then sending that data to the children via `Queues`. I would probably only favor opening the file in each child if you're sending huge amounts of data through the Queues, since the IPC cost there will be very large. – dano Dec 17 '14 at 18:42
  • What I want to do is read 512 byte (or more depending on the clustersize of that hard drive image) and process that. What would be your final recommendation? Queues or filehandler for every child process? – ap0 Dec 17 '14 at 18:46

1 Answers1

1

It has to do with passing file as an argument. Changed fp to file

import multiprocessing
from functools import partial

def foo(x, fp):
    print str(x) + " "+ str(file.closed)
    return

if __name__=='__main__':
    with open("test.txt", 'r') as file:
        pool = multiprocessing.Pool(multiprocessing.cpu_count())
        partial_foo = partial(foo, fp=file)
        print file.closed
        pool.map(partial_foo, [1,2,3,4])
        pool.close()
        pool.join()
        print file.closed
        print "done"

output

False
1 False
2 False
3 False
4 False
False
done
Robert Jacobs
  • 3,266
  • 1
  • 20
  • 30
  • What? Why does that work? Shouldn't I get a error saying that `file` in function `foo` is unknown? But you are right, it works. – ap0 Dec 17 '14 at 18:01
  • But where? At least I didn't create a global file object. Or did I? Sorry, if this question is stupid. – ap0 Dec 17 '14 at 18:03
  • It's because on Linux, `fork` is used to spawn the child processes in the `Pool`, so `file` object defined inside the `if __name__ == "__main__"` block gets inherited in each child. This code would break if you opened the file after creating the `Pool` instance. – dano Dec 17 '14 at 18:03
  • This answer makes no sense. You're can remove fp as far as I can tell... All you did was to take `file` from `main`'s scope. – Reut Sharabani Dec 17 '14 at 18:04
  • 2
    So to state it another way, you're not actually sending the open fd between process in the `pool.map` call. The open fd is inherited from the global state by each child when `fork` is called in the `multiprocessing.Pool(...)` call, which means you can access it in the worker process. – dano Dec 17 '14 at 18:06
  • Well the problem is that I unintentionally made the file object globally. Roberss solutions works but not if I work it into my actual program -- where the file object ist created not globaly. I didn't think about that. – ap0 Dec 17 '14 at 18:06