5

I have written a simple example to illustrate what exactly I'm banging my head onto. Probably there is some very simple explanaition that I just miss.

import time
import multiprocessing as mp
import os


class SomeOtherClass:
    def __init__(self):
        self.a = 'b'


class SomeProcessor(mp.Process):
    def __init__(self, queue):
        super().__init__()
        self.queue = queue

    def run(self):
        soc = SomeOtherClass()
        print("PID: ", os.getpid())
        print(soc)

if __name__ == "__main__":
    queue = mp.Queue()

    for n in range(10):
        queue.put(n)

    processes = []

    for proc in range(mp.cpu_count()):
        p = SomeProcessor(queue)
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

Result is:

PID: 11853
<__main__.SomeOtherClass object at 0x7fa637d3f588>
PID: 11854
<__main__.SomeOtherClass object at 0x7fa637d3f588>
PID: 11855
<__main__.SomeOtherClass object at 0x7fa637d3f588>
PID: 11856
<__main__.SomeOtherClass object at 0x7fa637d3f588>

Object address is the same for all, regardless every initialization happened in a new process. Can anyone point out what's the problem. Thanks.

Also I wonder about this behaviour, when I first initialize the same object in the main process then cache some values on it and then initialize the same object on every process. Then the processes inherit the main process object.

import time
import multiprocessing as mp
import os
import random

class SomeOtherClass:

    c = {}

    def get(self, a):
        if a in self.c:
            print('Retrieved cached value ...')
            return self.c[a]

        b = random.randint(1,999)

        self.c[a] = b

        return b


class SomeProcessor(mp.Process):
    def __init__(self, queue):
        super().__init__()
        self.queue = queue

    def run(self):
        pid = os.getpid()
        soc = SomeOtherClass()
        val = soc.get('new')
        print("Value from process {0} is {1}".format(pid, val))

if __name__ == "__main__":
    queue = mp.Queue()

    for n in range(10):
        queue.put(n)

    pid = os.getpid()
    soc = SomeOtherClass()
    val = soc.get('new')
    print("Value from main process {0} is {1}".format(pid, val))

    processes = []

    for proc in range(mp.cpu_count()):
        p = SomeProcessor(queue)
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

Output here is :

Value from main process 13052 is 676
Retrieved cached value ...
Value from process 13054 is 676
Retrieved cached value ...
Value from process 13056 is 676
Retrieved cached value ...
Value from process 13057 is 676
Retrieved cached value ...
Value from process 13055 is 676
Mario Kirov
  • 341
  • 1
  • 11
  • Sorry to not mention that. Exact version of Python outputing this result is 3.6.9 – Mario Kirov Sep 07 '21 at 11:26
  • 1
    why do you think it is a problem ? – balderman Sep 07 '21 at 11:28
  • 1
    There is no problem here. The instances are in different processes and don't share state. – AKX Sep 07 '21 at 11:33
  • 1
    @AKX I can prove you wrong. Exactly that's why I'm asking of this behaviour. – Mario Kirov Sep 07 '21 at 11:34
  • I wish people when posting questions tagged with **multiprocessing** would also tag the question with the platform, such as **linux** as they are supposed to. I bet you are running under Linux. – Booboo Sep 07 '21 at 11:34
  • @Booboo Yes, Linux, to be exact xUbuntu 18.04 – Mario Kirov Sep 07 '21 at 11:39
  • @MarioKirov Can you? How? `multiprocessing.Queue`s are special objects that _are_ shared between processes. Also, if you're on Linux and you're using the `fork` spawn method, any object state that exists before you spawn new processes _is_ shared (for reading; writing will not affect the other processes, unless it's one of the special multiprocessing objects). – AKX Sep 07 '21 at 12:39
  • @AKX Well true, but look at the new example I posted in the question. Thanks – Mario Kirov Sep 07 '21 at 12:43
  • 1
    @MarioKirov You declare `SomeOtherClass.c` as a class-level variable. It will be shared between all `SomeOtherClass`instances in the same process too. (If you want it to be instance-level, you'll need to do `self.c = {}` in `__init__`.) By virtue of forking, the same value will be in the child processes too. – AKX Sep 07 '21 at 13:16
  • @AKX That's it, that explained it enough, very helpful. Thank you very much. – Mario Kirov Sep 07 '21 at 13:18

3 Answers3

8

To expand on the comments and discussion:

  • On Linux, multiprocessing defaults to the fork start method. Forking a process means child processes will share a copy-on-write version of the parent process's data. This is why the globally created objects have the same address in the subprocesses.
    • On macOS and Windows, the default start method is spawn – no objects are shared in that case.
  • The subprocesses will have their unique copies of the objects as soon as they write to them (and internally in CPython, in fact, when they even read them, due to the reference counter being in the object header).
  • A variable defined as
    class SomeClass:
        container = {}
    
    is class-level, not instance-level and will be shared between all instances of SomeClass. That is,
    a = SomeClass()
    b = SomeClass()
    print(a is b)  # False
    print(a.container is b.container is SomeClass.container)  # True
    a.container["x"] = True
    print("x" in b.container)  # True
    print("x" in SomeClass.container)  # True
    
    By virtue of the class's state being forked into the subprocess, the shared container also seems shared. However, writing into the container in a subprocess will not appear in the parent or sibling processes. Only certain special multiprocessing types (and certain lower-level primitives) can span process boundaries.
  • To correctly separate that container between instances and processes, it will need to be instance-level:
    class SomeClass:
        def __init__(self):
            self.container = {}
    
    (However, of course, if a SomeClass is globally instantiated, and a process is forked, its state at the time of the fork will be available in subprocesses.)
AKX
  • 152,115
  • 15
  • 115
  • 172
  • 1
    Very well explained. Thank you one more time for the deep and simple explanations. – Mario Kirov Sep 07 '21 at 13:43
  • I am not sure it *fully* explains what the OP sees. I agree at the time of the *fork* all objects in the forked processes will have the same address as the main process. But in the OP's case each process is now continuing to run and each process is creating *new* objects. It is these new objects in each address space that have the same address. Now you can argue that each forked process is proceeding identically to one another and *that* is why these objects will have the same address. But you did not argue that. If memory allocation used a random number generator, you would *not* see this. – Booboo Sep 08 '21 at 21:44
1

tldr: They're actually not the same instance, so don't worry about that.

Well that's interesting. Their memory reference is exactly the same, but the instances are definitely different. If we modify the code like this:

import time
import multiprocessing as mp
import os


class SomeOtherClass:
    def __init__(self, num):
        self.a = num  # <-- Let's identify the instance with the pid
    
    def __str__(self):
        return f"I'm number {self.a}"  # <-- Better representation of the instance


class SomeProcessor(mp.Process):
    def __init__(self, queue):
        super().__init__()
        self.queue = queue

    def run(self):
        soc = SomeOtherClass(os.getpid())  <-- Use the PID to instantiate different objects
        print("PID: ", os.getpid())
        print(soc)
        time.sleep(1)
        print(soc)  # <-- Give it a second and print again

if __name__ == "__main__":
    queue = mp.Queue()

    for n in range(10):
        queue.put(n)

    processes = []

    for proc in range(mp.cpu_count()):
        p = SomeProcessor(queue)
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

We can see that the instances are definitely different, and they aren't being modified, because after the time.sleep() they still have their attributes unchanged:

PID:  668424
I'm number 668424
PID:  668425
I'm number 668425
PID:  668426
I'm number 668426
...
I'm number 668435
I'm number 668424
I'm number 668426
...

Yet, if we remove the __str__ function, I still see the same memory reference:

<__main__.SomeOtherClass object at 0x7f3e08d83bb0>
PID:  669008
<__main__.SomeOtherClass object at 0x7f3e08d83bb0>
PID:  669009
<__main__.SomeOtherClass object at 0x7f3e08d83bb0>
PID:  669010
...
<__main__.SomeOtherClass object at 0x7f3e08d83bb0>
<__main__.SomeOtherClass object at 0x7f3e08d83bb0>
<__main__.SomeOtherClass object at 0x7f3e08d83bb0>
...

To be honest, I don't really know the reason why this happens, so other people could help you more. As the user Booboo has said, you're seeing this because of the fact that Linux uses fork to start a new process. I did run this in a Linux machine too. If Windows had been used, the memory reference would be different.

Shinra tensei
  • 1,283
  • 9
  • 21
  • The processes are parallel alright. The GIL does not span processes, and during a `fork` there's no pausing or pickling involved. – AKX Sep 07 '21 at 13:39
  • @AKX no, not during the fork, but during the change of process. afaik, `multiprocessing` uses `pickle` in order to save the states of the objects in the process, so I thought it was possible that the instances that were being unpickled ended up in the same memory address as the ones in the other processes. However, it's true that `multiprocessing` side-steps the GIL – Shinra tensei Sep 07 '21 at 14:00
  • Forking _is_ "the change of process". Multiprocessing only serializes (e.g. pickles) objects when you're explicitly sending them across processes, e.g. with queues. – AKX Sep 07 '21 at 14:21
  • All right, I understand what you're saying and it's a good explanation. There's only one thing we both might be misunderstanding: 'Forking is "the change of process"'. I might not be expressing myself correctly, when I say "the change of process" I'm not talking about the creation of a new process, I already know that's what forking is, I'm rather talking about the context switching. There's no forking involved there. – Shinra tensei Sep 08 '21 at 10:46
0

Look at the modified code that shows that every SomeOtherClass is different.

import time
import multiprocessing as mp
import os


class SomeOtherClass:

  def __new__(cls, *args, **kwargs):
        print('-- inside __new__ --')
        return super(SomeOtherClass, cls).__new__(cls, *args, **kwargs)


    def __init__(self):
        self.a = os.getpid()
    def __str__(self):
        return f'{self.a}'


class SomeProcessor(mp.Process):
    def __init__(self, queue):
        super().__init__()
        self.queue = queue

    def run(self):
        soc = SomeOtherClass()
        print("PID: ", os.getpid())
        print(soc)

if __name__ == "__main__":
    queue = mp.Queue()

    for n in range(10):
        queue.put(n)

    processes = []

    for proc in range(mp.cpu_count()):
        p = SomeProcessor(queue)
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

output

 -- inside __new__ --
PID:  25054
25054
-- inside __new__ --
PID:  25055
25055
-- inside __new__ --
PID:  25056
25056
-- inside __new__ --
PID:  25057
25057
-- inside __new__ --
PID:  25058
25058
-- inside __new__ --
PID:  25059
25059
-- inside __new__ --
PID:  25060
25060
-- inside __new__ --
PID:  25061
25061
balderman
  • 22,927
  • 7
  • 34
  • 52
  • True and not true. They are on a different process, indeed. But what i want to accomplish is having a new instance not the same instance on a different process. – Mario Kirov Sep 07 '21 at 11:33
  • ` same instance on a different process` - this is not the situation since we have a clear evidence the `__init__` was called N times. Isnt it? – balderman Sep 07 '21 at 11:36
  • 3
    @MarioKirov They *are* different instances by dint of being in different address spaces. They just happened to have the same address. I conjectured that you are running under Linux. If you were running under Windows you would probably not see this. Just guessing . When I run this under Linux I get the same addresses but not under Windows. Linux creates processes using *fork* vs. *spawn* for Windows. – Booboo Sep 07 '21 at 11:40
  • I have added `__new__` implementation in order to convince you that it is NOT the same instance. Look at the code. – balderman Sep 07 '21 at 11:40
  • @balderman Ok in this case you are right, they are different, my bad, just because I cut down the full example. Now added it in the question below the first one. Thanks – Mario Kirov Sep 07 '21 at 12:14
  • @MarioKirov Great. Feel free to accept the answer. – balderman Sep 07 '21 at 12:41
  • @balderman Can you take a look at new example in the question. Thanks – Mario Kirov Sep 07 '21 at 12:42
  • @MarioKirov I believe this is something totally different from the original code.. I think overloading a post with 2 questions is not the right thing to do. You can accept and close this question and create a new one. – balderman Sep 07 '21 at 12:47