Parallel processing a function that's in a separate module

Question

I have what should be an "embarrasingly parallel" task: I'm trying to parse a number of log files in a CPU heavy manner. I don't care about the order they're done in, and the processes don't need to share any resources or threads.

I'm on a Windows machine.

My setup is something like:

main.py

import parse_file
import multiprocessing

...

files_list = ['c:\file1.log','c:\file2.log']

if __name__ == '__main__':
    pool = multiprocessing.Pool(None)

    for this_file in files_list:
        r = pool.apply_async(parse_file.parse, (this_file, parser_config))
        results = r.get()

...

#Code to do stuff with the results

parse_file is basically an entirely self-contained module that doesn't access any shared resources - the results are returned as a list.

This all runs absolutely fine when I ran it without multiprocessing, but when I enable it, what happens is that I get a huge wall of errors that indicate that the source module (the one that is in) is the one that's being run in parrallel. (The error is a database locking error for a something that is only in the source script (not the parse_file module), and at a point before the multiprocessing stuff!)

I don't pretend to understand the multiprocessing module, and worked from other examples here , but none of them include anything that indicates this is normal or why it's happening.

What am I doing wrong? How do I multi-process this task? Thanks!

Easily replicable using this: test.py

import multiprocessing
import test_victim

files_list = ['c:\file1.log','c:\file2.log']

print("Hello World")

if __name__ == '__main__':
    pool = multiprocessing.Pool(None)
    results = []
    for this_file in files_list:
        r = pool.map_async(test_victim.calculate, range(10), callback=results.append)
        results = r.get()

    print(results)

test_victim.py:

def calculate(value):
    return value * 10

The output when you run test.py should be:

Hello World
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

But in reality it is:

Hello World
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
Hello World
Hello World

(The actual number of extra "Hello World"s) changes every time I run it between 1 and 4 = there should be none)

Make sure the `for-loop` is indented so as to be inside the `if __name__ ...` statement. Otherwise the code will import-bomb on Windows. — unutbu, Dec 25 '13 at 17:02
@unutbu - Thanks. Per your suggestion I've just done that but it makes absolutely no difference I'm afraid. :-( (example updated to reflect this)) — GIS-Jonathan, Dec 25 '13 at 17:04
Please post the stack trace, at least the first few and last few lines. — unutbu, Dec 25 '13 at 17:04
In the stack trace, what is the last line that begins with `File` which refers to the path to your script (`main.py`)? and what is the line that follows it? — unutbu, Dec 25 '13 at 17:12
I don't know what the problem is, but it looks it to me like we would need to see the structure of how you are using sqlite3. A runnable example which reproduces the error would be terrific. — unutbu, Dec 25 '13 at 17:33
@unutbu - Simple reproducable case appended to the question. No SQLite necessary. :-) — GIS-Jonathan, Dec 25 '13 at 17:49

unutbu · Accepted Answer · 2013-12-25T18:22:52.290

On Windows, when Python executes

pool = multiprocessing.Pool(None)

new Python processes are spawned. Because Windows does not have os.fork, these new Python processes re-import the calling module. Thus, anything not inside

if __name__ == '__main__':

gets executed once for each process spawned. That is why you are seeing multiple Hello Worlds.

Be sure to read the "Safe importing of main module" warning in the docs.

So to fix, put all the code that needs to run only once inside the

if __name__ == '__main__':

statement.

For example, your runnable example would be fixed by placing

print("Hello World")

inside the if __name__ == '__main__' statement:

import multiprocessing
import test_victim

files_list = ['c:\file1.log','c:\file2.log']

def main():
    print("Hello World")
    pool = multiprocessing.Pool(None)
    results = []
    for this_file in files_list:
        r = pool.map_async(test_victim.calculate, range(10), callback=results.append)
        results = r.get()

    print(results)

if __name__ == '__main__':
    main()

yields

Hello World
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

Especially on Windows, scripts that use multiprocessing must be both runnable (as a script) and importable. An easy way to make a script importable is to structure it as is shown above. Place everything that the script should execute inside a function called main, and then just use

if __name__ == '__main__':
    main()

at the end of the script. The stuff before main should just be import statements and the definition of global constants.

Thanks. I actually tried reading that document before posting this question, but it assumes a level of knowledge I simply don't have (i.e., it made no sense). main.py is too complex to add the __name__ thing to. so I created a separate file with just the multiprocessing stuff and am calling that from main.py but I still can't get it to work (anything inside the __name__ if doesn't get run *ever* - and if it's left out I get all manner of randomness). Can you provide an example with test.py please? — GIS-Jonathan, Dec 25 '13 at 18:09
Thanks! In the end I just stuck `if __name__ == '__main__':` as the very first line of my main.py script. Everything is now ugly-indented but oh well. Thanks again. — GIS-Jonathan, Dec 25 '13 at 18:23

Parallel processing a function that's in a separate module

1 Answers1