-1

I have the following snippet below from a main.py script

------------------------------------------------------------------------------------------------

a bunch of code above with that has a write-to-disk process that was meant to be only run once
------------------------------------------------------------------------------------------------

context =  get_context('spawn')
someInitData1= context.Value('i', -1)
someInitData2= context.Value('i', 0)

with concurrent.futures.ProcessPoolExecutor(max_workers=4,
                                        mp_context=context,
                                        initializer=util.init_func, 
                                        initargs=(someInitData1,someInitData2)
                                       ) as executor:
        multiProcessResults= [x for x in executor.map(util.multi_process_job, 
                                           someArguments1,
                                           someArguments2,
                                          )]

I intend only to have util.multi_process_job be parallelized with multiprocessing. Although, for some reason, with this snippet, all of the code in my main.py get's repeated from the beginning and in parallel as a new process by the workers.

What is strange to me is that the following snippet works fine for my needs when I run it via jupyter notebook. Only the specified function runs. The problem only occurs when I convert the ipnyb file to .py file and run it as a regular python script on a linux machine.

RabbitBadger
  • 539
  • 7
  • 19
  • 2
    Because each process has to import your module (with some multiprocessing backends; the `fork()`-based ones don't have this issue). While it's _possible_ to avoid this by changing how you configure `multiprocessing`, the _better_ answer is the one that also makes your code compatible with a wider array of static analysis tools: You shouldn't have your main code unconditionally at the top level; put it a function, call that function only if `__name__ == '__main__'`. – Charles Duffy Aug 19 '22 at 21:48
  • 1
    Not closing this as a duplicate because some of the details of that question are Windows-specific and don't apply to you in entirety, but [Python multiprocessing on Windows, `if __name__ == "__main__":`](https://stackoverflow.com/questions/20222534/python-multiprocessing-on-windows-if-name-main) is _very_ relevant (on Windows, `fork()` isn't an option as it is on Linux, so the configuration that avoids the issue without fixing your code isn't possible... but you should still fix your code anyhow). – Charles Duffy Aug 19 '22 at 21:52
  • This problem seems to pop up fairly regularly. – Roland Smith Aug 19 '22 at 22:00
  • Thank you for the answers! To quickly test my script, I wrapped all the codes of my main.py in the suggested if condition. Surely enough, the script is working now. I am interested in what you said with "While it's possible to avoid this by changing how you configure multiprocessing" Do you have any references for this? – RabbitBadger Aug 19 '22 at 22:08

1 Answers1

1

The problem is here:

context = get_context('spawn')

...wherein you're forcing a mode that's compatible with Windows, but creates each new process as a completely new copy of Python. Consequently, those new processes need to import your module separately; thus they rerun any code that's invoked on import.

To avoid this, use get_context('fork') to make each new process be a copy of your existing Python process, with all the prior state (like the modules already loaded and cached in memory) available.


Alternately, you can put all your top-level code inside an if __name__ == '__main__': gate, so it only runs when your script is executed, but not when it's imported.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441