I have what should be an "embarrasingly parallel" task: I'm trying to parse a number of log files in a CPU heavy manner. I don't care about the order they're done in, and the processes don't need to share any resources or threads.
I'm on a Windows machine.
My setup is something like:
main.py
import parse_file
import multiprocessing
...
files_list = ['c:\file1.log','c:\file2.log']
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
for this_file in files_list:
r = pool.apply_async(parse_file.parse, (this_file, parser_config))
results = r.get()
...
#Code to do stuff with the results
parse_file
is basically an entirely self-contained module that doesn't access any shared resources - the results are returned as a list.
This all runs absolutely fine when I ran it without multiprocessing, but when I enable it, what happens is that I get a huge wall of errors that indicate that the source module (the one that is in) is the one that's being run in parrallel. (The error is a database locking error for a something that is only in the source script (not the parse_file module), and at a point before the multiprocessing stuff!)
I don't pretend to understand the multiprocessing module, and worked from other examples here , but none of them include anything that indicates this is normal or why it's happening.
What am I doing wrong? How do I multi-process this task? Thanks!
Easily replicable using this: test.py
import multiprocessing
import test_victim
files_list = ['c:\file1.log','c:\file2.log']
print("Hello World")
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
results = []
for this_file in files_list:
r = pool.map_async(test_victim.calculate, range(10), callback=results.append)
results = r.get()
print(results)
test_victim.py:
def calculate(value):
return value * 10
The output when you run test.py should be:
Hello World
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
But in reality it is:
Hello World
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
Hello World
Hello World
(The actual number of extra "Hello World"s) changes every time I run it between 1 and 4 = there should be none)