0

I'm using a Pool of workers and want each of them to get initialized with a specific object. More precisely, the initialization cannot be parallelized, so that I plan to prepare the objects in the main process before to create the workers and give each worker one of these objects.

Here is my attempt :

import multiprocessing
import random
import time

class Foo:
    def __init__(self, param):
        # NO WAY TO PARALLELIZE THIS !!
        print(f"Creating Foo with {param}")
        self.param = param

    def __call__(self, x):
        time.sleep(1)
        print("Do the computation", self)
        return self.param + str(x)

def initializer():
    global myfoo

    param = random.choice(["a", "b", "c", "d", "e"])
    myfoo = Foo(param)

def compute(x):
    return myfoo(x)

multiple_results = []
with multiprocessing.Pool(2, initializer, ()) as pool:
    for i in range(1, 10):
        work = pool.apply_async(compute, (i,))
        multiple_results.append(work)

    print([res.get(timeout=2) for res in multiple_results])

Here is a possible output:

Creating Foo with b
Creating Foo with a
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
Do the computation <__main__.Foo object at 0x7f8d70aa7fd0>
['b1', 'a2', 'b3', 'a4', 'b5', 'a6', 'b7', 'a8', 'b9']

What is puzzling me is that the address of the Foo object is always the same while the actual Foo object is different as can be seen by the output: "b1", "a2".

My problem is that the two calls to initializer are parallelized while I do not want to parallelize the construction of Foo.

I want some magical method add_worker to do something like this:

pool = multiprocessing.Pool()
for i in range(0,2):
    foo = Foo()
    poo.add_worker(initializer, (foo,))

Any ideas ?

EDIT: I solved my real live problem by making the import of kera's VGGNet inside the process instead of on top of the file. See this answer For the sake of curiosity, I remain interested in an answer.

Laurent Claessens
  • 547
  • 1
  • 3
  • 18
  • Interesting. On Windows I get two separate addresses... Maybe it has to do with difference in memory management between the two systems. Maybe on Linux the addresses are per process's memory area so you get the same address... – Tomerikoo Jan 07 '21 at 08:35
  • This is an interesting issue, but I have one question. Why does it bother you? As long as the results are ok... – Tomerikoo Jan 07 '21 at 08:39
  • @Tomerikoo Because the initializer function just don't finishes when parallelized. In my real live setting, the Foo object creates a kera.applications.vgg16; maybe I should open a specific question "how to parallelize vgg16" ? – Laurent Claessens Jan 07 '21 at 08:47
  • documentation says: `If initializer is not None then each worker process will call initializer(*initargs) when it starts.` And if you don't won't to run it for each worker then create `Foo()` before `Pool`. But maybe you should rather select one `param = random.choice(["a", "b", "c", "d", "e"])` before `Poll` and send it to workers `apply_async(compute, (param, i,))` and then everu worker should create own `Foo` with the same `param` – furas Jan 07 '21 at 09:08
  • @furas This more or less my idea in the last code snipset with the hypothetical "add_worker" method. But how do I pass the *different* instances of `Foo` to each worker initializer ? The synatx `Pool(...)` does not seem to allow that. – Laurent Claessens Jan 07 '21 at 09:12
  • 1
    if you want different instances in each worker then create them in `compute` – furas Jan 07 '21 at 09:15
  • 1
    @furas if you create a new instance of `Foo` in `compute`, every execution of `compute` will have a different `Foo` instance, not every worker. So if you have 10 calls of `compute` with 2 workers, you will have 10 different instances of `Foo`, while @Laurent Clasessesns seems to want only two of them (one for each worker). If having a different instance of `Foo` for each execution of `compute` is fine, then your approach is right. – PieCot Jan 07 '21 at 09:26
  • @PieCot OK, now I better understand this question. – furas Jan 07 '21 at 09:35
  • @PieCor @furas The approach of Furas is reasonable. In `initalizer` I do `global foo; foo=None` then in `compute` I do `if not foo: foo=Foo()`. It turns out that it does not change the result : the intialization does not finish when parallelized. – Laurent Claessens Jan 07 '21 at 09:39

1 Answers1

1

As you said, you can see that instances of Foo are actually different for each worker, even if they apparently have the same address.

As explained in answers to this question and this question, you should not rely on addresses to distinguish between instances on different processes: different instances can show the same address.

PieCot
  • 3,564
  • 1
  • 12
  • 20
  • I was thinking also about addresses. Old systems CPU (Intel/AMD) could use addresses relative in process instead of absolute addresses and it seems it makes something similar. – furas Jan 07 '21 at 09:53
  • Thanks for your answer which clarifies one point. But this does not answer the main question. – Laurent Claessens Jan 07 '21 at 10:38