8

I've already solved my problem by moving the import to the top declarations, but it left me wondering: Why cant I use a module that was imported in '__main__' in functions that are the targets of multiprocessing?

For example:

import os
import multiprocessing as mp

def run(in_file, out_dir, out_q):
    arcpy.RaterToPolygon_conversion(in_file, out_dir, "NO_SIMPIFY", "Value")
    status = str("Done with "+os.path.basename(in_file))
    out_q.put(status, block=False)

if __name__ == '__main__':
    raw_input("Program may hang, press Enter to import ArcPy...")
    import arcpy

    q = mp.Queue()
    _file = path/to/file
    _dir = path/to/dir
    # There are actually lots of files in a loop to build
    # processes but I just do one for context here
    p = mp.Process(target=run, args=(_file, _dir, q))
    p.start()

# I do stuff with Queue below to status user

When you run this in IDLE it doesn't error at all...just keeps doing a Queue check (which is good so not the problem). The problem is that when you run this in the CMD terminal (either OS or Python) it produces the error that arcpy is not defined!

Just a curious topic.

  • Are you running on linux or windows? – tdelaney Apr 21 '17 at 14:26
  • @tdelaney Windows, that's why I am using the `if __name__` statement. –  Apr 21 '17 at 14:40
  • On WIndows, `multiprocessing` effectively `import`s the main script into each Python subprocess it spawns, so the `if __name__ == '__main__'` will be `False` in those cases. In your script, that means that the module `arcpy` won't have been imported when `run()` is executed because the process it's in executes in completely separate memory-space. – martineau Apr 21 '17 at 15:08

1 Answers1

7

The situation is different in unix-like systems and Windows. On the unixy systems, multiprocessing uses fork to create child processes that share a copy-on-write view of the parent memory space. The child sees the imports from the parent, including anything the parent imported under if __name__ == "__main__":.

On windows, there is no fork, a new process has to be executed. But simply rerunning the parent process doesn't work - it would run the whole program again. Instead, multiprocessing runs its own python program that imports the parent main script and then pickles/unpickles a view of the parent object space that is, hopefully, sufficient for the child process.

That program is the __main__ for the child process and the __main__ of the parent script doesn't run. The main script was just imported like any other module. The reason is simple: running the parent __main__ would just run the full parent program again, which mp must avoid.

Here is a test to show what is going on. A main module called testmp.py and a second module test2.py that is imported by the first.

testmp.py

import os
import multiprocessing as mp

print("importing test2")
import test2

def worker():
    print('worker pid: {}, module name: {}, file name: {}'.format(os.getpid(), 
        __name__, __file__))

if __name__ == "__main__":
    print('main pid: {}, module name: {}, file name: {}'.format(os.getpid(), 
        __name__, __file__))
    print("running process")
    proc = mp.Process(target=worker)
    proc.start()
    proc.join()

test2.py

import os

print('test2 pid: {}, module name: {}, file name: {}'.format(os.getpid(),
        __name__, __file__))

When run on Linux, test2 is imported once and the worker runs in the main module.

importing test2
test2 pid: 17840, module name: test2, file name: /media/td/USB20FD/tmp/test2.py
main pid: 17840, module name: __main__, file name: testmp.py
running process
worker pid: 17841, module name: __main__, file name: testmp.py

Under windows, notice that "importing test2" is printed twice - testmp.py was run two times. But "main pid" was only printed once - its __main__ wasn't run. That's because multiprocessing changed the module name to __mp_main__ during import.

E:\tmp>py testmp.py
importing test2
test2 pid: 7536, module name: test2, file name: E:\tmp\test2.py
main pid: 7536, module name: __main__, file name: testmp.py
running process
importing test2
test2 pid: 7544, module name: test2, file name: E:\tmp\test2.py
worker pid: 7544, module name: __mp_main__, file name: E:\tmp\testmp.py
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • My `mp.Process()` is not rerunning `__main__` each time is it? I just want each child process to run the code in `def run()`. –  Apr 21 '17 at 14:39
  • 1
    No, but it is re-importing your main module and calling it `__mp_main__`. That's why you hide stuff you don't want rerun under `if __name__ == "__main__":`. I've updated the answer with a demo. – tdelaney Apr 21 '17 at 15:02
  • 1
    Child startup is significantly more expensive in Windows - a new copy of python is executed and modules are imported. – tdelaney Apr 21 '17 at 15:04
  • Excellent explanation! Unfortunately, arcpy cant be multithreaded and the processes can only run in parallel with multiprocessing. It is very expensive, do you have any suggestions to make it less so? –  Apr 21 '17 at 16:04
  • 1
    Start long-lived subprocesses early and keep them around a long time. Maybe a `Pool` and use `apply` when running a work item. Modules not needed by the parent can be imported in the worker process itself. Passing large datasets from parent to child is expensive also... have child read original files from disk if possible. – tdelaney Apr 21 '17 at 16:09
  • 1
    Alternately, write a completely separate child process that communicates with some sort of RPC. maybe python's xml-rpc or (more to my liking) `zeromq`. Once again, keeping the payload between parent and child lean really helps. – tdelaney Apr 21 '17 at 16:10
  • Thanks for that great explanation of the functions [`_fixup_main_from_name`](https://github.com/python/cpython/blob/v3.10.2/Lib/multiprocessing/spawn.py#L240-L262) and [`_fixup_main_from_path`](https://github.com/python/cpython/blob/v3.10.2/Lib/multiprocessing/spawn.py#L265-L290) from the module `multiprocessing.spawn`. – Géry Ogam Feb 09 '22 at 18:02