0

Let's say I have the following class defined:

class Animal:
    def __init__(self):
        self.isAlive = True

Along with the following function:

def Kill_Animal(animal):
    animal.isAlive = False

Now, if I create a List of animals, as follows:

AnimalsList = [Animal() for i in range(0,5)]

If the function is applied to any instance of the Animal Class inside the list, the isAlive attribute gets changed to False. However, if I wanted to apply this function to this list and change its contents via the multiprocessing library, what would be the correct way to do it?

I have tried the following:

from multiprocessing import Process, Pool

pool = Pool()
pool.map(Kill_Animal, AnimalsList[0:3])

However, if i try checking the attribute for all the elements inside the list, the result is as follows:

[print(animal.isAlive) for animal in AnimalsList]

Output: True True True True True

Additionally, if I try checking the ID of the object that is passed to the Kill_Animal function during runtime via the Pool.Map, it does not match with the object's own ID. I am familiar with Python's call-by-object reference, but what is happening here?

fabio.avigo
  • 308
  • 1
  • 13
  • 1
    `multiprocessing` does not share state. It is literally multiple different python processes. – juanpa.arrivillaga Sep 24 '18 at 20:38
  • @juanpa.arrivillaga I see. So what would be the correct way to do this, if I wanted to modify an instance of a class (not replace it) with multiprocessing? – fabio.avigo Sep 24 '18 at 20:44
  • The ideal way is to refactor your code *not* to require shared state. I would read through the [documentation](https://docs.python.org/3.7/library/multiprocessing.html#sharing-state-between-processes) to see what options you do have for sharing state. – juanpa.arrivillaga Sep 24 '18 at 20:58
  • @juanpa.arrivillaga Thank you. Yes, I have a class with a large number of modules for selenium web-parsing, and the serial execution works like a charm. I have been trying to add parallelism to it for improving the performance by running multiple browsers at once, but perhaps I've been looking at it through a wrong angle. – fabio.avigo Sep 24 '18 at 21:07
  • If it's for selenium, then threading might work. Not familiar with the library or the python bindings though. – juanpa.arrivillaga Sep 24 '18 at 21:20
  • 1
    You *can* share state across processes using various methods including multiprocessing's queues and managers, but as far as Selenium, you'll probably want to send job details to your processes and have them instantiate their own resources independently rather than attempting to pass objects around like this. Of course, be careful about this, since too many headless browsers is an easy way to invoke the ire of the OOM killer. – kungphu Sep 25 '18 at 00:53
  • @kungphu Thank you, that could probably work. How would I pass class instances to a queue and fetch the result back, once it's finished? The underlying problem here is that I truly don't undestand how Python is handling things. Why does it operate on a different instance of my class, than the one that I'm passing it? My selenium processes receive basic details and manage their resources independently, but I'm just not able to fetch the data that the instance of the class collected (and the instance in my main process remains unchanged). – fabio.avigo Sep 25 '18 at 12:26
  • @fabio.avigo You really want to minimize the data you're passing around. The [multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html) contains a lot of good examples and explains the restrictions and built-in utilities much better than I could. – kungphu Sep 25 '18 at 23:40

1 Answers1

1

After studying the multiprocessing documentation, I understood the misinterpretation of the concept.

With multiprocessing, even if an instance of a class is passed as an argument, it makes sense that the ID is different from the one in the calling method, since now we are working in a different Process altogether, and therefore this object is a copy of the original object, and does not correspond to the same place in memory. Because of that, whatever changes made in the copy have no impact on its original instance.

In order to use parallellism and share states, a different concept must be applied, the one of multithreading, as available in the thread-based parallellism documentation. The difference between multithreading and multiprocessing has been thoroughly discussed here: Multiprocessing vs Threading Python

Returning to the original question, two easy ways could be achieved to loop through the List and apply the function:

1. Using the multiprocessing.dummy:

multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.

So the answer could be written as:

import multiprocessing.dummy as mp
p = mp.Pool(3) # With 3 being the number of threads.
p.map(Kill_Animal, AnimalsList)
p.close()
p.join()

[print(animal.isAlive) for animal in AnimalsList]

Output: False False False False False

2. Using a Queue:

from queue import Queue
from threading import Thread

# Creates the hunter thread.
def hunter():
    while True:
        animal = q.get()
        Kill_Animal(animal)
        q.task_done()

num_hunter_threads = 3
q = Queue()

#Initialize the threads
for i in range(num_hunter_threads):
    t = Thread(target=hunter)
    t.daemon = True
    t.start()

#Adds each animal in the list to the Queue.
for animal in AnimalsList:
    q.put(animal)

#Execute the jobs in the queue.
q.join()

[print(animal.isAlive) for animal in AnimalsList)

Output: False False False False False

Community
  • 1
  • 1
fabio.avigo
  • 308
  • 1
  • 13