Multiprocessing Strategy - Shared Nested Object

Question

I am working on a research project and want to apply parallelization to improve execution speed. I have worked with the multiprocessing library before, but only for number crunching. I will try to briefly describe the setting and the goal that I have. I primarily hope for an idea, from folks more experienced with the multiprocessing concepts.

The project:

The project is a multi echolon supply chain simulation (a multi level distribution network) where reorder decisions are made periodically at each location, based on the incoming demand. A toy example looks as follows:

  Level 3               Level 2                 Level 1             Level 0

                                         --- Local Warehouse 1
                                        |
             --- Central Warehouse 1 --
            |                           |
            |                            --- Local Warehouse 2
            |
Supplier --                                                        Customer
            |                            --- Local Warehouse 3
            |                           |
             --- Central Warehouse 2 --
                                        |
                                         --- Local Warehouse 4

The simulaion objects (simplified) are as follows:

class Simulation:
  self.locations = dict() #List of locations
  self.customer = Customer() #Object periodically ordering at deepest level (Local) Warehouses
  self.levels = {0: [], 1:[],..} # Locations by depth in network graph
  def run(self):
    for period in simulation_length:
      for level in self.levels:
        for location in level:
          #review orders and issue order if required

class Location:
  self.orders = [] #list of received orders
  def review(self):
     #Decides based on received orders if reorder required
  def order(self, order, other_location):
       simulation.locations[other_location].orders.append(order)

So the procedure looks as follows:

Customer (Level 0) issues orders to Local Warehouses (Level 1)
Local Warehouses (Level 1) review orders and issue orders to Central Warehouses (Level 2)
And so on, until Supplier
Next Period

My Problem / Idea

For now I have a dict of all warehouses belonging to a particular level of the supply chain, and I iterare over each warehouse in each level in order (so dependencies are met) each period. The number of levels is quiet limited, but the number of warehouses per level quiet large and the review logic can be computational intensive, therefore my plan is, to review all warehouses belonging to the same level in parallel.

However, as a location uses a function order(self, order, other_location) which accesses an attribute of another object within the simulation object, I need to share the whole simulation object between the processes.

Ideas and approaches:

Put the sumulation object in shared memory and use a Lock on the object, whenever an order is placed (all other actions within the review are purely read operations)
Instead of placing the orders directly, putting them in a Queue to the main process and after all warehouses within a level return, just execute the order functions (computational inexpensive)

Problem with (1):

From all my research, only CType objects Value and Array can be put in shared memory. I couldnt figure out how. Only thing I read was the multiprocessing Manager, but another stackoverflow questions Link said, that it does not work with nested objects.

Problem with (2):

As each warehouse object changes between the periods (orders arrive, inventory changes,..) I would have to hand over the warehouse object to the process each period, in order for it to be up to date, which would create a large overhead (at least I think it is such)

Conclusion

I hope its clear what I want to achive. Any hint, clarification or correction of a missunderstanding on my side would be great!

Edit with regards to answer by @Roy12:

Thanks for the answer. I will definitely take a look at Dask, as the ultimate goal is to utilize a cluster. With regards to the first hint, two implementations come to my mind, which I would appreciate your advice: My Locations need to receive and to send order objects, the sending part is controlled by the object itself, the receiving is not. Therefore option 1 for me is

At the beginning of a period spawn processes with the up to date location object do the calculations and not send the order directly but put them in the queue and close the process. When a whole level is done, the main process distributes the orders and spawns processes for the next level and so on. That leads to regularly spawning and closing processes and depending on the simulation length, the location objects become rather large
I statically map locations to processes in the beginning and have an incoming queue and an outgoing queue and have the mainprocess do the distributen of the orders e.g. Process 1 (Location 1) sends an order to Process 2 (Location 2) would be -> Process 1 -> Main Process -> Process 2. In this case the process needs to be given a signal each time it is to process the orders and does the routine (Read queue -> recalculate -> Send order to queue)

(2) seems more sophisticated to me, but I have no feeling for the downsides, other then collecting in the end has to be programmed. If it is important, the order objects is of size ~ 40bytes the location object (warehouse) grows to about 15mb throughout the run

Process don't share memory so there's really no "shared memory", each runs in its own address-space. — martineau, May 12 '20 at 18:00
@martineau I was refering to this: https://docs.python.org/3/library/multiprocessing.shared_memory.html — FloLie, May 12 '20 at 18:10
Oh, sorry, that's a new feature that was added in Python 3.8 (that I'm not familiar with). — martineau, May 12 '20 at 18:12
@FloLie - I'm not sure I understand the last sentence. What do you mean by "~40 bytes locations"? — Roy2012, May 15 '20 at 05:56
@Roy2012 sorry that was missleading. Just as a reference I wanted to say how large the objects to be exchanged in either of the approaches are. The order object is what is send between the warehouses and about 40bytes large, a warehouse object grows as it stores the orders and grows to about 15mb. Does that clearify? — FloLie, May 15 '20 at 07:31
@FloLie - one more question - how many locations do you have at the widest level? — Roy2012, May 15 '20 at 07:53
Hey. I updated my answer yesterday. Have a look - let me know if it makes sense? — Roy2012, May 16 '20 at 09:16

Roy2012 · Answer 1 · 2020-05-15T08:04:52.740

A nice use case. Some thoughts/proposals:

Don't use shared memory. It's considered bad practice these days. People used to use shared memory for concurrency back in the days, but the modern approach is to avoid that as much as possible. The Go language, for example, provides some nice alternatives to that (see https://blog.golang.org/codelab-share). Another downside of shared memory is that you can't distribute your work across multiple machines.
Using Queues is generally much better. If the data you're moving back and for the between processes isn't huge - many (many) megabytes - the overhead would be negligible.
For your use case, you may want to consider using a distributed computation framework such as Dask. It provides easy means to collect results from sub-tasks, and only then start working on the next level in a hierarchy. Furthermore, it will allow you to distribute your work across an entire cluster, and not just a single machine.

Hope this helps.

Update following some scale data:

The question states that the size of a location is 15MB, and the size of an order is ~40 bytes (significantly smaller).

Given that, it's obvious that if we optimize for low network traffic, we'll go for model #1, in which each location is a process that lives throughout the simulation, and communicates with other locations view queues and messages.

But - and it's a big but - running all the communication over queues seems like a more complex implementation. Creating a process with 15MB of data should take less than a second. If the computation at each location is nontrivial, it probably requires much more time than the process creation itself. For that reason, I would probably start with the simpler implementation (spawn a new process for every location).

In other words, it seems that building the entire system around queues is somewhat of premature optimization.

One final note: there's a simulation package for Python called SimPy. I don't know how scalable it is, but it's probably worth having a look at.

Thank you for the answer. I updated the question with a comment to your answer and would appreciate your opinion on it — FloLie, May 14 '20 at 08:21

Multiprocessing Strategy - Shared Nested Object

1 Answers1