20

I am using the Pool class from python's multiprocessing library write a program that will run on an HPC cluster.

Here is an abstraction of what I am trying to do:

def myFunction(x):
    # myObject is a global variable in this case
    return myFunction2(x, myObject)

def myFunction2(x,myObject):
    myObject.modify() # here I am calling some method that changes myObject
    return myObject.f(x)

poolVar = Pool()
argsArray = [ARGS ARRAY GOES HERE]
output = poolVar.map(myFunction, argsArray)

The function f(x) is contained in a *.so file, i.e., it is calling a C function.

The problem I am having is that the value of the output variable is different each time I run my program (even though the function myObject.f() is a deterministic function). (If I only have one process then the output variable is the same each time I run the program.)

I have tried creating the object rather than storing it as a global variable:

def myFunction(x):
    myObject = createObject()
    return myFunction2(x, myObject)

However, in my program the object creation is expensive, and thus, it is a lot easier to create myObject once and then modify it each time I call myFunction2(). Thus, I would like to not have to create the object each time.

Do you have any tips? I am very new to parallel programming so I could be going about this all wrong. I decided to use the Pool class since I wanted to start with something simple. But I am willing to try a better way of doing it.

Hugh Medal
  • 301
  • 1
  • 2
  • 4
  • Could you fix this program to be one that runs? Declaring the functions after you try to use them won't work in Python (and could be relevant to your problem) – Thomas Sep 13 '13 at 05:24
  • Is `myObject.modify()` idempotent? That is, can you call it an arbitrary number of times without changing what it does (such as, a `reset()` function)? If so, your code should work. If not, you'll have issues because the different processes will each modify their own copies of the object separately from each other, and so you may get duplicated values across processes. – Blckknght Sep 14 '13 at 04:51
  • Yes, myObject.modify() is idempotent. – Hugh Medal Sep 16 '13 at 11:45

2 Answers2

36

I am using the Pool class from python's multiprocessing library to do some shared memory processing on an HPC cluster.

Processes are not threads! You cannot simply replace Thread with Process and expect all to work the same. Processes do not share memory, which means that the global variables are copied, hence their value in the original process doesn't change.

If you want to use shared memory between processes then you must use the multiprocessing's data types, such as Value, Array, or use the Manager to create shared lists etc.

In particular you might be interested in the Manager.register method, which allows the Manager to create shared custom objects(although they must be picklable).

However I'm not sure whether this will improve the performance. Since any communication between processes requires pickling, and pickling takes usually more time then simply instantiating the object.

Note that you can do some initialization of the worker processes passing the initializer and initargs argument when creating the Pool.

For example, in its simplest form, to create a global variable in the worker process:

def initializer():
    global data
    data = createObject()

Used as:

pool = Pool(4, initializer, ())

Then the worker functions can use the data global variable without worries.


Style note: Never use the name of a built-in for your variables/modules. In your case object is a built-in. Otherwise you'll end up with unexpected errors which may be obscure and hard to track down.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • 1
    Thanks! Actually, I do want each worker to have its own copy of the global variable and be able to modify it. (I changed my question to reflect this.) I will check out your answer. – Hugh Medal Sep 13 '13 at 12:09
  • 6
    I tried your your solution above but it is not working. I still have the same problem. – Hugh Medal Sep 24 '13 at 15:06
  • Thank you so much, I was using processes for opening multiple webpages in parallel instead of threads and wondering why the global variables are not working as expected. – Vissu Nov 26 '19 at 18:16
  • What happens if global data is being modified by worker processes? Will the result reflect in the global data? – CKM Apr 29 '20 at 14:05
  • @chandresh If you want data to be shared by multiple processes you need to use special objects from the `multiprocessing` module, see [the documentation](https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes). Changes to **these** objects will be reflected among processes. – Bakuriu Apr 30 '20 at 07:02
  • @Bakuriu actually, I've looked at the docs especially `Value, Arrya, Manager` but none of them worked for me. I tried your `initializer` approach but did not work. I'm gonna open a new question [here](https://stackoverflow.com/questions/61518970/sharing-mutable-global-variable-in-python-multiprocessing-pool) – CKM Apr 30 '20 at 08:22
  • If you do not edit the global variable it should not be copied, at least not in Linux. – Radio Controlled Oct 15 '20 at 09:39
  • i tried `initializer` but that `data` var is not available to task function when submitted to pool, the task will err `NameError: name 'data' is not defined` – Dee Dec 31 '22 at 07:46
  • @Dee you can open a new question,but I believe you are doing something different. I just tried with python 3.10.6 and it works fine. Create a file `test_pool.py` write in there `from multiprocessing import Pool` the definition of `initializer` above, maybe replace `createObject()` with `{'a':1}` as example, then define the `Pool` as described in the answer, define a function `def print_data(): print(data)`, finally call `pool.apply(print_data)` and run with `python3 test_pool.py` you will see `{'a':1}` printed. – Bakuriu Dec 31 '22 at 09:26
-1

Global keyword works on the same file only. Another way is to set value dynamically in pool process initialiser, somefile.py can just be an empty file:

import importlib

def pool_process_init():
    m = importlib.import_module("somefile.py")
    m.my_global_var = "some value"

pool = Pool(4, initializer=pool_process_init)

How to use the var in task:

def my_coroutine():
    m = importlib.import_module("somefile.py")
    print(m.my_global_var)
Dee
  • 7,455
  • 6
  • 36
  • 70
  • This is false. You are probably missing some detail. The initializer can set new globals and they are available to the functions. Open a question showing the actual code you are trying and we will see what's the problem, but my answer works exactly as describe up to python 3.10. – Bakuriu Dec 31 '22 at 09:29
  • Oh, but it shows name not found for me – Dee Dec 31 '22 at 10:10
  • I corrected my answer – Dee Dec 31 '22 at 10:11
  • Possibly my case is that the init func and task func are on different files – Dee Dec 31 '22 at 10:15