1

Just getting started with using the multiprocessing library in my code base to parallelise a simple for loop, where previously, in a serial for loop, I would import a custom configuration .py file and pass it to be a function to be run.

However I'm having issues with passing in the configuration module to be parellelised.

NB. There are multiple custom configuration.py which I want to pass into the different processes.

Example:

def get_custom_config(): 
   config_list = []
   for project_config in configs:
       config = importlib.import_module("config.%s.%s" % (prefix, project_config)
       config_list.append(config)
   return config_list

def print_config(config):
   print config.something_in_config_file

if __name__ = "__main__":
   config_list = get_custom_config()

   pool = mp.Pool(processes=2)
   pool.map(print_config, config_list)

Returns:

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
cPickle.PicklingError: Can't pickle <type 'module'>: attribute lookup __builtin__.module failed

What is the best way of passing a module to a parallel process?

Rekovni
  • 6,319
  • 3
  • 39
  • 62

1 Answers1

2

I do have a possible solution for you, but I don't like the approach you have.

config = importlib.import_module("config.%s.%s" % (prefix, project_config)

You should try and have config as a dictionary of key value pairs instead as a module. Or import it that way.

The issue is that functions and modules are not picklable by default in Python 2.7. Functions are picklable by default in Python 3.X and modules are still not.

import importlib
import multiprocessing as mp

configs = ["abc", "def"]
import copy_reg
import types


def _pickle_module(module):
    module_name = module.__name__
    print("pickling" + module_name)
    path = getattr(module, "__file__", None)
    return _unpickle_module, (module_name, path)


def _unpickle_module(module_name, path):
    return importlib.import_module(module_name)

copy_reg.pickle(types.ModuleType, _pickle_module, _unpickle_module)


def get_custom_config():
    config_list = []
    for project_config in configs:
        config = importlib.import_module("config.%s" % (project_config))
        config_list.append(config)
    return config_list


def print_config(config):
    print (vars(config))


if __name__ == "__main__":
    config_list = get_custom_config()

    pool = mp.Pool(processes=2)
    pool.map(print_config, config_list)

This basically re-imports the module in the other process, so do remember you are not sharing data between them. This is a good read only variables.

But as I mentioned passing modules to a different process makes less sense. Try to fix your approach instead of using the code I posted

PS: Solution inspired from Can't pickle <type 'cv2.BRISK'>: attribute lookup cv2.BRISK failed

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • If I'm using the module as read-only, would using this method be okay? Or is this just bad practice, and instead I should do what you suggested and import the module into a dictionary instead as better practice? (Thanks for the working code as well!) – Rekovni Apr 30 '18 at 15:35
  • 1
    You can import configs at run-time, no issues with that. But when you are sending `config_list` to the pool, it should not contain modules, either send actual configs or pass just the module names and let each process load its own copy of the config – Tarun Lalwani Apr 30 '18 at 15:37