4

Problem

I am writing a piece of software in which I would like to share an object from a certain module. This object should be modifiable from different modules, and within different processes. Consider following (simplified) version of the problem:

Modules

module_shared.py

# Example class with simplified behaviour
class Shared:

    def __init__(self):
        self.shared = dict()

    def set(self, **kwargs):
        for key, value in kwargs.items():
            self.shared[key] = value

    def get(self, *args):
        return {key: self.shared[key] for key in args} if args else self.shared

# Module-scope instance of the Shared class
shared = Shared()

module_a.py

from multiprocessing import Process
from time import sleep
import module_shared as ms

def run():
    Process(target=run_process).start()

def run_process():
    i = 0
    while True:
        sleep(3)
        ms.shared.set(module_a=i)
        i+=1
        print("Shared from within module_a", ms.shared.get())

module_b.py

from multiprocessing import Process
from time import sleep
import module_shared as ms


def run():
    Process(target=run_process).start()

def run_process():
    i = 0
    while True:
        sleep(2)
        ms.shared.set(module_b=i)
        i-=1
        print("Shared from within module_b", ms.shared.get())

module_main.py

import module_a
import module_b
import module_shared as ms
from time import sleep

if __name__ == '__main__':
    module_a.run()
    module_b.run()
    while True:
        sleep(5)
        print("Shared from within module_main", ms.shared.get())

Output

The output of running module_main is as follows:

Shared from within module_b {'module_b': 0}
Shared from within module_a {'module_a': 0}
Shared from within module_b {'module_b': -1}
Shared from within module_main {}
Shared from within module_a {'module_a': 1}
Shared from within module_b {'module_b': -2}
...

Expected output is as follows:

Shared from within module_b {'module_b': 0}
Shared from within module_a {'module_a': 0, 'module_b': 0}
Shared from within module_b {'module_a': 0, 'module_b': -1}
Shared from within module_main {'module_a': 0, 'module_b': -1}
Shared from within module_a {'module_a': 1, 'module_b': -1}
Shared from within module_b {'module_a': 1, 'module_b': -2}
...

Further explanation

The shared instance is not modified globally because each Process has its own memory space. Initially I have tried fixing it using the Manager from multiprocessing module, however, I have failed to set it up, I presume due to the errors with when and how the import statements are executed. Here is the error message when calling Manager() in Shared's __init__:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

At the moment the best solution was to use threading, however I would prefer to use the processes instead. Naturally, in case any simpler (or better) solutions exist, I would be very happy to consider them.

EDIT:

I have realised I made a typo in my previous attempt with threading, and using multiple threads actually works perfectly fine. Such a great lesson to learn to read your code twice...

Kacperito
  • 1,277
  • 1
  • 10
  • 27
  • Without specifics to the details of what happened when you tried to use a `Manager`, it's hard to help, but you might look at: https://stackoverflow.com/questions/11951750/sharing-object-class-instance-in-python-using-managers, https://stackoverflow.com/questions/20892977/shared-memory-complex-writable-data-structures, https://stackoverflow.com/questions/2227169/are-python-built-in-containers-thread-safe – mostsquares Mar 05 '19 at 19:48
  • A solution that I gravitate to for things like this is to have one thread that manages the data structure and broadcasts it back to the others using inter-process communication. I like the library ZeroMQ for that: https://pyzmq.readthedocs.io/en/latest/api/zmq.html – mostsquares Mar 05 '19 at 19:51
  • Another alternative would be running a little `sqlite` database server on your computer and having the processes talk to that. – mostsquares Mar 05 '19 at 19:52
  • Thank you @CharlieWindolf. I'll have a look at the resources you have suggested tomorrow. I don't think SQL-type of data management (or any other file-related solution) would work because of performance issues. I believe the errors related to using the `Manager` were with its initialisation within a module scope, due to some issues with the process execution (I can recreate them later and post complete trace). – Kacperito Mar 05 '19 at 20:13
  • @CharlieWindolf I have included the error thrown when calling `Manager()`. I also realised that threading actually worked from the start, but my test code had a typo, so I presume the ZeroMQ library will not be needed (unless it offers a big performance boost compared to `threading` and `multiprocessing` libraries?). If possible, could you please include a working example with the `Manager`? I don't think I understand how is it supposed to work in the end. – Kacperito Mar 06 '19 at 18:56
  • Did you try following the advice that the error gives? Namely, put `import multiprocessing; multiprocessing.freeze_support()` right after `if __name__ == '__main__':` in `module_main.py`? – mostsquares Mar 06 '19 at 19:00
  • zmq could help you improve performance of your code *if* the performance problem is the inter-process communication, but I would make sure that that's actually the problem before looking at it. zeromq is used alongside `multiprocessing`/whatever parallel processing lib you end up with. (Don't use `threading`) – mostsquares Mar 06 '19 at 19:03
  • @CharlieWindolf I have, I tried using it in both `module_main.py` and in `module_shared.py` (within module scope) but I think it will always keep throwing this error because of importing `module_shared.py` in different modules. – Kacperito Mar 06 '19 at 19:04
  • The complete trace suggests that first main imports `module_a`, which imports `module_shared`, and creates the `shared` objects. Just after that, when the `Manager`'s `start()` is called, `_check_not_importing_main()` fails (because the import is not forking the `Manager` process?). – Kacperito Mar 06 '19 at 19:10

2 Answers2

1

One approach would be to use one of the various caching modules. diskcache, shelve, etc. all offer the ability to persist objects. Of course, pickle.

For example, using the diskcache library, you could take this approach, replacing your module_shared.py with:

### DISKCACHE Example ###
from diskcache import Cache

cache = Cache('test_cache.cache')

# Example class with simplified behaviour
class Shared:

    def __init__(self, cache):
        self.cache = cache
        self.cache.clear()

    def set(self, **kwargs):
        for key, value in kwargs.items():
            cache.set(key, value)

    def get(self, *args):
        return {key: cache.get(key) for key in args} if args else {(key, cache.get(key)) for key in cache.iterkeys()}


# Module-scope instance of the Shared class
shared = Shared(cache)

Output:

Shared from within module_b {('module_b', 0)}
Shared from within module_a {('module_a', 0), ('module_b', 0)}
Shared from within module_b {('module_a', 0), ('module_b', -1)}
Shared from within module_main {('module_a', 0), ('module_b', -1)}
Shared from within module_a {('module_b', -1), ('module_a', 1)}
Shared from within module_b {('module_b', -2), ('module_a', 1)}

In the above example, the module_shared.py is the only changed file.

Each of the various persistence libraries/approaches has it's own quirks and capabilities. If you absolutely need to persist the class instance object entire, that's in there. :) Performance merely depends on your implentation and choice of caching mechanism. diskcache has proven quite capable for me.

I've implemented diskcache very simply here to to demonstrate the functionality. Be sure to read the docs, which are clear and concise, for better understanding.

Also, my output is presenting an unordered dict. You could easily yield that sorted to match your own output with module_a consistently first. I left that bit out for simplicity.

Chris Larson
  • 1,684
  • 1
  • 11
  • 19
  • Thank you for the solution and a link to such a great and simple to use tool. I'm a bit worried as the docs mention that "want read:write at 10:1 or higher" would be the most suitable, whereas my code will be performing a significant number of writes compared to reads. – Kacperito Mar 06 '19 at 19:00
  • Regarding `diskcache`, that's a good question. I've asked on the github repository if that comment is still valid 2 years down the road. There are, as I mentioned, several caching/persistence solutions in and for python, and I simply used `diskcache` to illustrate how they could offer your solution. :) There may be a better solution for your specific needs. It may be that a bit of benchmarking/testing will show that it solves the problem nicely, or suggest a better option. It's a good question. Thanks for posing it. Cheers! – Chris Larson Mar 06 '19 at 21:53
  • Here's the response from the author: https://github.com/grantjenks/python-diskcache/issues/104 – Chris Larson Mar 07 '19 at 03:03
  • I have tested it today and it works really well, thank you! `FanoutCache` is awesome for handling a number of parallel writes, and the tool supports standard dictionary indexing so I only had to change like 1 or 2 lines of code to make it work. – Kacperito Mar 07 '19 at 22:52
  • Excellent! Glad I could help. Bonus: The author of `diskcache` is super-responsive if you have questions. Cheers! – Chris Larson Mar 08 '19 at 01:32
0

Looking at the documentation for custom Manager objects, here's an idea.

Add these lines to module_shared.py:

from multiprocessing.managers import BaseManager

class SharedManager(BaseManager):
    pass

SharedManager.register('Shared', Shared)
manager = SharedManager()
manager.start()
shared = manager.Shared()

(Get rid of the old definition of shared)

Running this on my computer produced

$ python module_main.py 
Shared from within module_b {'module_b': 0}
Shared from within module_a {'module_b': 0, 'module_a': 0}
Shared from within module_b {'module_b': -1, 'module_a': 0}
Shared from within module_main {'module_b': -1, 'module_a': 0}
Shared from within module_a {'module_b': -1, 'module_a': 1}
Shared from within module_b {'module_b': -2, 'module_a': 1}
Shared from within module_b {'module_b': -3, 'module_a': 1}
Shared from within module_a {'module_b': -3, 'module_a': 2}
Shared from within module_main {'module_b': -3, 'module_a': 2}
Shared from within module_b {'module_b': -4, 'module_a': 2}
...etc

which looks to me like the expected result.

It's a little weird that module_shared.py starts a process (the line manager.start()) as we don't typically expect modules to do anything, but with the constraints of the question I think this is the only way to do it. If I were writing this for myself, I'd make the manager in module_main instead of module_shared the same way we did here (maybe using the context manager described in the documentation link above instead of the .start method) and I would pass that manager as a function argument to the run methods of a and b.

You might also be interested in the SyncManager which is a subclass of BaseManager that has already registered a lot of basic types, including dict, which basically covers the functionality here.

mostsquares
  • 834
  • 8
  • 27
  • I am avoiding the reference passing idea due to the overall messiness of the code. In such case, whenever trying to use the shared object in a function, a reference to it would have to be passed as an argument. In case of multiple objects and extensive usage of them, each function would have multiple, additional arguments that are only used to store the reference information. The module imports are far cleaner and easier to understand. – Kacperito Mar 06 '19 at 21:02
  • No, still getting the same issue. `manager.start()` is causing the error, and trying to start it in main won't work becuase `manager.Shared()` call requires a started manager. – Kacperito Mar 07 '19 at 21:15
  • Really, that's weird -- this worked on my machine. You have `manager.start()` in `module_shared.py` like described here? I'm running python3, are you? Hmmm.... – mostsquares Mar 07 '19 at 22:35