multiprocessing - calling function with different input files

Question

I have a function which reads in a file, compares a record in that file to a record in another file and depending on a rule, appends a record from the file to one of two lists.

I have an empty list for adding matched results to:

match = []

I have a list restrictions that I want to compare records in a series of files with.

I have a function for reading in the file I wish to see if contains any matches. If there is a match, I append the record to the match list.

def link_match(file):
    links = json.load(file)
    for link in links:
        found = False
        try:
            for other_link in other_links:
                if link['data'] == other_link['data']:
                    match.append(link)
                    found = True
                else:
                    pass
        else:
            print "not found"

I have numerous files that I wish to compare and I thus wish to use the multiprocessing library.

I create a list of file names to act as function arguments:

list_files=[]
for file in glob.glob("/path/*.json"):
    list_files.append(file)

I then use the map feature to call the function with the different input files:

if __name__ == '__main__':
    pool = multiprocessing.Pool(processes=6)
    pool.map(link_match,list_files)
    pool.close()
    pool.join()

CPU use goes through the roof and by adding in a print line to the function loop I can see that matches are being found and the function is behaving correctly.

However, the match results list remains empty. What am I doing wrong?

Modify `link_match()` to return a list of matches found instead of directly modifying a single global list, and then modify your master script to combine all the returned lists into one big list. — John Gordon, Aug 29 '16 at 17:07

Vivian · Answer 1 · 2016-08-30T17:18:04.753

multiprocessing runs a new instance of Python for each process in the pool - the context is empty (if you use spawn as a start method) or copied (if you use fork), plus copies of any arguments you pass in (either way), and from there they're all separate. If you want to pass data between branches, there's a few other ways to do it.

Instead of writing to an internal list, write to a file and read from it later when you're done. The largest potential problem here is that only one thing can write to a file at a time, so either you make a lot of separate files (and have to read all of them afterwards) or they all block each other.
Continue with multiprocessing, but use a multiprocessing.Queue instead of a list. This is an object provided specifically for your current use-case: Using multiple processes and needing to pass data between them. Assuming that you should indeed be using multiprocessing (that your situation wouldn't be better for threading, see below), this is probably your best option.
Instead of multiprocessing, use threading. Separate threads all share a single environment. The biggest problems here are that Python only lets one thread actually run Python code at a time, per process. This is called the Global Interpreter Lock (GIL). threading is thus useful when the threads will be waiting on external processes (other programs, user input, reading or writing files), but if most of the time is spent in Python code, it actually takes longer (because it takes a little time to switch threads, and you're not doing anything to save time). This has its own queue. You should probably use that rather than a plain list, if you use threading - otherwise there's the potential that two threads accessing the list at the same time interfere with each other, if it switches threads at the wrong time.

Oh, by the way: If you do use threading, Python 3.2 and later has an improved implementation of the GIL, which seems like it at least has a good chance of helping. A lot of stuff for threading performance is very dependent on your hardware (number of CPU cores) and the exact tasks you're doing, though - probably best to try several ways and see what works for you.

Nit-pick: The current context isn't actually copied to each new process. Each one imports the main module so it will start out with any context set up by doing that. This is way it's important to have the `if __name__ == '__main__':` statement in the main module to define the part of the main script that _doesn't_ get executed everytime it's imported. See the **Safe importing of main module** subsection of the [Windows](https://docs.python.org/2/library/multiprocessing.html#windows) section of the `multiprocessing` _Programming Guidelines_. — martineau, Aug 30 '16 at 16:35
@martineau Checked - looks to me like it possibly goes either way (see [Contexts and Start Methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)) depending on OS and certain settings. Adjusted accordingly. — Vivian, Aug 30 '16 at 17:20
It's unclear what version of Python and OS the OP is using. Regardless, it's important to write code that works on multiple OSs whenever feasible, as it usually is since that's one of Python's greatest strengths, IMO. Besides, even on other OSs, the `mp.set_start_method()` call is placed after a `if __name__ == '__main__':` statement in the examples to put in it into effect before any other processes are started. — martineau, Aug 30 '16 at 17:37

score 1 · Answer 2 · answered Aug 29 '16 at 16:58

Multiprocessing creates multiple processes. The context of your "match" variable will now be in that child process, not the parent Python process that kicked the processing off.

Try writing the list results out to a file in your function to see what I mean.

score 1 · Answer 3 · edited May 23 '17 at 11:51

To expand cthrall's answer, you need to return something from your function in order to pass the info back to your main thread, e.g.

def link_match(file):
    [put all the code here]
    return match

[main thread]
all_matches = pool.map(link_match,list_files)

the list match will be returned from each single thread and map will return a list of lists in this case. You can then flatten it again to get the final output.

Alternatively you can use a shared list but this will just add more headache in my opinion.

martineau · Accepted Answer · 2016-08-30T17:29:12.500

When multiprocessing, each subprocess gets its own copy of any global variables in the main module defined before the if __name__ == '__main__': statement. This means that the link_match() function in each one of the processes will be accessing a different match list in your code.

One workaround is to use a shared list, which in turn requires a SyncManager to synchronize access to the shared resource among the processes (which is created by calling multiprocessing.Manager()). This is then used to create the list to store the results (which I have named matches instead of match) in the code below.

I also had to use functools.partial() to create a single argument callable out of the revised link_match function which now takes two arguments, not one (which is the kind of function pool.map() expects).

from functools import partial
import glob
import multiprocessing

def link_match(matches, file):  # note: added results list argument
    links = json.load(file)
    for link in links:
        try:
            for other_link in other_links:
                if link['data'] == other_link['data']:
                    matches.append(link)
                else:
                    pass
        else:
            print "not found"

if __name__ == '__main__':
    manager = multiprocessing.Manager()  # create SyncManager
    matches = manager.list()  # create a shared list here
    link_matches = partial(link_match, matches)  # create one arg callable to
                                                 # pass to pool.map()
    pool = multiprocessing.Pool(processes=6)
    list_files = glob.glob("/path/*.json")  # only used here
    pool.map(link_matches, list_files)  # apply partial to files list
    pool.close()
    pool.join()
    print(matches)

As always, many thanks @martineau. This is very useful. I have one final obstacle however. I am unable to `dump` the resulting list. Error message `TypeError: is not JSON serializable`. I then tried to print the list before dumping and I get the following ``. I have never come across a ListProxy object before! In order to convert it I tried `manager.dict()` in place of `manager.list()`and also converting the final list to a `dict` but no avail. — LearningSlowly, Aug 29 '16 at 20:09
I was able to print its contents, as indicated by the last line of my answer. Try using `list(matches)` to convert it to a regular list. — martineau, Aug 29 '16 at 21:13

multiprocessing - calling function with different input files

4 Answers4