14

As we all know we need to protect the main() when running code with multiprocessing in Python using if __name__ == '__main__'.

I understand that this is necessary in some cases to give access to functions defined in the main but I do not understand why this is necessary in this case:

file2.py

import numpy as np
from multiprocessing import Pool
class Something(object):
    def get_image(self):
        return np.random.rand(64,64)

    def mp(self):
        image = self.get_image()
        p = Pool(2)
        res1 = p.apply_async(np.sum, (image,))
        res2 = p.apply_async(np.mean, (image,))
        print(res1.get())
        print(res2.get())
        p.close()
        p.join()

main.py

from file2 import Something
s = Something()
s.mp()

All of the functions or imports necessary for Something to work are part of file2.py. Why does the subprocess need to re-run the main.py?

I think the __name__ solution is not very nice as this prevents me from distribution the code of file2.py as I can't make sure they are protecting their main. Isn't there a workaround for Windows? How are packages solving that (as I never encountered any problem not protecting my main with any package - are they just not using multiprocessing?)

edit: I know that this is because of the fork() not implemented in Windows. I was just asking if there is a hack to let the interpreter start at file2.py instead of main.py as I can be sure that file2.py is self-sufficient

skjerns
  • 1,905
  • 1
  • 16
  • 25
  • 4
    The `if __name__ == '__main__'` hack is only necessary on Windows, since that platform does not have `fork()`. If you choose _any_ other operating system, you won't need it. – Sven Marnach Jul 14 '17 at 19:30
  • OP, just to confirm, you _are_ on windows, correct? – cs95 Jul 14 '17 at 19:37
  • 2
    If I understand you correctly, you're writing `file2.py` as a library, and you want to support user code like `main.py` (which may be written by some other person in the future). Unfortunately, I don't think there's any way to protect your users from the requirements of `multiprocessing`. You probably just need to document that your module requires script code to be put within `if __name__ == "__main__"` blocks so that nothing gets run if the module is imported. – Blckknght Jul 15 '17 at 10:32
  • @Blckknght Thanks, that was exactly the answer I was looking for! (Although not the answer I was hoping for ;) ) – skjerns Jul 18 '17 at 10:50
  • It is not clear why the new process needs to reimport `main.py` rather than just reimporting `file2.py`?? Is this to guarantee that all variables are defined (definitions imported). In other words, in this example it wouldn't be needed but if you passed in some variable into `mp` then you might run into problems if that variable was defined in a module imported in `main.py`??? – Jimbo Jul 23 '22 at 18:43
  • And what if you ran the contents of `file2.py` from the console? – Jimbo Jul 23 '22 at 18:44

4 Answers4

7

When using the "spawn" start method, new processes are Python interpreters that are started from scratch. It's not possible for the new Python interpreters in the subprocesses to figure out what modules need to be imported, so they import the main module again, which in turn will import everything else. This means it must be possible to import the main module without any side effects.

If you are on a different platform than Windows, you can use the "fork" start method instead, and you won't have this problem.

That said, what's wrong with using if __name__ == "__main__":? It has a lot of additional benefits, e.g. documentation tools will be able to process your main module, and unit testing is easier etc, so you should use it in any case.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
3

As others have mentioned the spawn() method on Windows will re-import the code for each instance of the interpreter. This import will execute your code again in the child process (and this will make it create it own child, and so on).

A workaround is to pull the multiprocessing script into a separate file and then use subprocess to launch it from the main script.

I pass variables into the script by pickling them in a temporary directory, and I pass the temporary directory into the subprocess with argparse.

I then pickle the results into the temporary directory, where the main script retrieves them.

Here is an example file_hasher() function that I wrote:

main_program.py

import os, pickle, shutil, subprocess, sys, tempfile

def file_hasher(filenames):
    try:
        subprocess_directory = tempfile.mkdtemp()
        input_arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
        with open(input_arguments_file, 'wb') as func_inputs:
            pickle.dump(filenames, func_inputs)
        current_path = os.path.dirname(os.path.realpath(__file__))
        file_hasher = os.path.join(current_path, 'file_hasher.py')
        python_interpreter = sys.executable
        proc = subprocess.call([python_interpreter, file_hasher, subprocess_directory],
                               timeout=60, 
                              )
        output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
        with open(output_file, 'rb') as func_outputs:
            hashlist = pickle.load(func_outputs)
    finally:
        shutil.rmtree(subprocess_directory)
    return hashlist

file_hasher.py

#! /usr/bin/env python
import argparse, hashlib, os, pickle
from multiprocessing import Pool

def file_hasher(input_file):
    with open(input_file, 'rb') as f:
        data = f.read()
        md5_hash = hashlib.md5(data)
    hashval = md5_hash.hexdigest()
    return hashval

if __name__=='__main__':
    argument_parser = argparse.ArgumentParser()
    argument_parser.add_argument('subprocess_directory', type=str)
    subprocess_directory = argument_parser.parse_args().subprocess_directory

    arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
    with open(arguments_file, 'rb') as func_inputs:
        filenames = pickle.load(func_inputs)

    hashlist = []
    p = Pool()
    for r in p.imap(file_hasher, filenames):
        hashlist.append(r)

    output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
    with open(output_file, 'wb') as func_outputs:
        pickle.dump(hashlist, func_outputs)

There must be a better way...

Chris Hubley
  • 311
  • 2
  • 9
  • No problem. I had the same issue, and it's the best that i could come up with. It's not pretty, but it allows you to run a multiprocessing script from within a module across multiple platforms. There's a bit of overhead, though, so if you knew that your code would never be run on Windows, I think that it's better to just neglect the if __name__=='__main__' protection. – Chris Hubley Dec 12 '18 at 11:39
2

the if __name__ == '__main__' is needed on windows since windows doesnt have a "fork" option for processes.

In linux, for example, you can fork the process, so the parent process will be copied and the copy will become the child process (and it will have access to the already imported code you had loaded in the parent process)

Since you cant fork in windows, python simply imports all the code that was imported by the parent process, in the child process. This creates a similar effect, but if you dont do the __name__ trick, this import will execute your code again in the child process (and this will make it create it own child, and so on).

so even in your example main.py will be imported again (since all the files are imported again). python cant guess what specific python script the child process should import.

FYI there are other limitations you should be aware of like using globals, you can read about it here https://docs.python.org/2/library/multiprocessing.html#windows

DorElias
  • 2,243
  • 15
  • 18
2

The main module is imported (but with __name__ != '__main__' because Windows is trying to simulate a forking-like behavior on a system that doesn't have forking). multiprocessing has no way to know that you didn't do anything important in you main module, so the import is done "just in case" to create an environment similar to the one in your main process. If it didn't do this, all sorts of stuff that happens by side-effect in main (e.g. imports, configuration calls with persistent side-effects, etc.) might not be properly performed in the child processes.

As such, if they're not protecting their __main__, the code is not multiprocessing safe (nor is it unittest safe, import safe, etc.). The if __name__ == '__main__': protective wrapper should be part of all correct main modules. Go ahead and distribute it, with a note about requiring multiprocessing-safe main module protection.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271