0

I am working on a project that requires me to extract a ton of information from some files. The format and most of the information about the project does not matter for what I am about to ask. I mostly do not understand how I would share this dictionary with all the processes in the process pool.

Here is my code (changed up variable names and deleted most of the code to just the need to know parts):

import json

import multiprocessing
from multiprocessing import Pool, Lock, Manager

import glob
import os

def record(thing, map):

    with mutex:
        if(thing in map):
            map[thing] += 1
        else:
            map[thing] = 1


def getThing(file, n, map): 
    #do stuff
     thing = file.read()
     record(thing, map)


def init(l):
    global mutex
    mutex = l

def main():

    #create a manager to manage shared dictionaries
    manager = Manager()

    #get the list of filenames to be analyzed
    fileSet1=glob.glob("filesSet1/*")
    fileSet2=glob.glob("fileSet2/*")

    #create a global mutex for the processes to share
    l = Lock()   

    map = manager.dict()
    #create a process pool, give it the global mutex, and max cpu count-1 (manager is its own process)
    with Pool(processes=multiprocessing.cpu_count()-1, initializer=init, initargs=(l,)) as pool:
        pool.map(lambda file: getThing(file, 2, map), fileSet1) #This line is what i need help with

main()

From what I understand, that lamda function should work. The line that i need help with is: pool.map(lambda file: getThing(file, 2, map), fileSet1). It give me an error there. The error given is "AttributeError: Cant pickle local object 'main..'".

Any help would be appreciated!

Paul
  • 11
  • 1

1 Answers1

0

In order to parallel-execute the tasks, the multiprocessing "pickles" the task function. In your case, this "task function" is lambda file: getThing(file, 2, map).

Unfortunately for you, by default, lambda functions can not be pickled in python (see also this stackoverflow post). Let me illustrate the problem with a minimal bit of code:

import multiprocessing

l = range(12)

def not_a_lambda(e):
    print(e)

def main():
    with multiprocessing.Pool() as pool:
        pool.map(not_a_lambda, l)        # Case (A)
        pool.map(lambda e: print(e), l)  # Case (B)

main()

In Case A we have a proper, free function which can be pickled an thus the pool.map operation will work. In Case B we have a lambda function and a crash will occur.

One possible solution is to use a proper module-scope function (like my not_a_lambda). Another solution is to rely on a third-party-module, like dill, to extend the pickling functionality. In the latter case, you'd use for example pathos as a replacement for the regular multiprocessing module. Finally, you could create a Worker class which collects your shared state as members. This could look something like this:

import multiprocessing

class Worker:
    def __init__(self, mutex, map):
        self.mutex = mutex
        self.map = map

    def __call__(self, e):
        print("Hello from Worker e=%r" % (e, ))
        with self.mutex:
            k, v = e
            self.map[k] = v
        print("Goodbye from Worker e=%r" % (e, ))

def main():
    manager = multiprocessing.Manager()
    mutex = manager.Lock()
    map = manager.dict()

    # there is only ONE Worker instance which is shared across all processes
    # thus, you need to make sure you don't access / modify internal state of
    # the worker instance without locking the mutex.
    worker = Worker(mutex, map)

    with multiprocessing.Pool() as pool:
        pool.map(worker, l.items())

main()
elemakil
  • 3,681
  • 28
  • 53
  • Thank you! Very helpful. The entire point of me using the lambda is to pass in the n, and shared dictionary. Is there better solution to this? It seems to be the only way I can have a shared dictionary between processes is by using a manager, then passing the dictionary between functions. As far as I know, making it global (like the lock) will not work – Paul Mar 02 '19 at 18:39
  • You can have a proper function which also gets the shared dictionary and other parameters passed in by using the `pool.starmap` function, see for example [this](stackoverflow) post. – elemakil Mar 02 '19 at 23:28
  • Can't edit my previous post, seems I screwer up the link, I meant [this](https://stackoverflow.com/questions/5442910/python-multiprocessing-pool-map-for-multiple-arguments) post. – elemakil Mar 03 '19 at 11:13