6

1) Does the multiprocessing module support Python script files I can use to start a second process instead of a function?

Currently I use multiprocessing.Process which takes a function but I would like to execute foo.py instead. I could use subprocess.Popen but the benefit of multiprocessing.Process is that I can pass objects (even if they are just pickled).

When I use multiprocessing.Process, why is my_module imported in the child process but print("foo") is not executed?

2) When I use multiprocessing.Process, why is my_module imported in the child process but print("foo") is not executed? How is my_module available although the main scope is not executed?

import multiprocessing
import my_module
print("foo")

def worker():
    print("bar")
    my_module.foo()
    return

p = multiprocessing.Process(target=worker, args=(1,2, d))
p.start()
p.join()
Daniel Stephens
  • 2,371
  • 8
  • 34
  • 86

2 Answers2

4

There is no obvious difference between a Python function and a routine you want to run in another process. Functions are just procedures.

Say if another script file (foo.py in this context) you wished to run in another process has following:

# for demonstration only
from stuff import do_things

a = 'foo'
b = 1
do_things(a, b) # it doesn't matter what this does

You could refactor foo.py this way

from stuff import do_things

def foo():
    a = 'foo'
    b = 1
    do_things(a, b)

And in the module you are spawning the process:

from foo import foo

p = multiprocess.Process(target=foo)
# ...

Process API requires that a "callable" is provided as a target. If say you tried to provided the module foo (where foo.py is the first version without a function foo):

import foo
p = Process(target=foo)
p.start()

You will get a TypeError: 'module' object is not callable error for a good reason. Imagine when you import foo module it eagerly executes right away since it's not wrapped inside a function/procedure aka callable. Try inserting a print statement in a module file and import it. Module-level statements are evaluated right away.

This answers question number 2:

When you imported my_module at the top level, it's imported once per module, even if worker was not executed. my_module was available to worker because worker procedure closes over my_module. When you pass a subroutine like worker to a concurrent process, there is no guarantee when it will be called or even will ever be.

You could import a module any where in a Python module, including within a function/subroutine. But doing so in this case might not be optimal or necessary.

Pandemonium
  • 7,724
  • 3
  • 32
  • 51
  • 1
    Awesome, that answers my questions! Thanks – Daniel Stephens Dec 07 '18 at 19:25
  • 1
    Great answer, but to number 2, the implementation and its behaviour completely differs on Win and Linux/Unix. Worth to mention that `foo` indeed gets printed on Windows 5 times, since the process is not forked and rather restarted – HelloWorld Dec 07 '18 at 19:26
  • @seb-mtl could you elaborate? Why did it print exactly 5 times? – Pandemonium Dec 07 '18 at 19:40
  • multiprocessing forks the process, so you have a clone. But processes execute different paths afterwards. But on Windows there is no real `fork` so it gets simulated in Python by the "closest" behaviour possible which results in that Python executes the entire same module again, and executes the function afterwards – HelloWorld Dec 07 '18 at 21:16
  • 1
    To explain that in detail a comment is not the right place but I found a great answer which covers exacly that. Look for the accepted answer: https://stackoverflow.com/questions/38236211/why-multiprocessing-process-behave-differently-on-windows-and-linux-for-global-o – HelloWorld Dec 07 '18 at 21:16
1

You can use multiprocessing.pool() and the pass the function inside the method which you want to execute. I have personally used it as you can split the data into multiple parts and also have the flexibility to use the number of cpu.

Vikika
  • 318
  • 1
  • 9
  • Could you add an example? – Sean Pianka Dec 07 '18 at 19:13
  • from multiprocessing import Pool pool=Pool(processes=mp.cpu_count()) objs=pool.imap_unordered(Inter_method, (np.array_split(Interchange_df, 5, axis=0))). In this case i am using the pool class of multiprocessing and imap method of that class. Inter_method is my function which i wanted to use and then i am splitting the dataframe into 5 parts. You can tweak this according to your requirements – Vikika Dec 07 '18 at 19:17