18

I'm trying to figure out a way to share memory between python processes. Basically there is are objects that exists that multiple python processes need to be able to READ (only read) and use (no mutation). Right now this is implemented using redis + strings + cPickle, but cPickle takes up precious CPU time so I'd like to not have to use that. Most of the python shared memory implementations I've seen on the internets seem to require files and pickles which is basically what I'm doing already and exactly what I'm trying to avoid.

What I'm wondering is if there'd be a way to write a like...basically an in-memory python object database/server and a corresponding C module to interface with the database?

Basically the C module would ask the server for an address to write an object to, the server would respond with an address, then the module would write the object, and notify the server that an object with a given key was written to disk at the specified location. Then when any of the processes wanted to retrieve an object with a given key they would just ask the db for the memory location for the given key, the server would respond with the location and the module would know how to load that space in memory and transfer the python object back to the python process.

Is that wholly unreasonable or just really damn hard to implement? Am I chasing after something that's impossible? Any suggestions would be welcome. Thank you internet.

nickneedsaname
  • 731
  • 1
  • 5
  • 22
  • 2
    Exactly how precious is your CPU time that it's worth dumping a working solution that's much less fiddly to keep synchronised than what you're proposing? What you're asking for can be done but it will be a huge pain in the ass to do *correctly*. – millimoose Jul 02 '12 at 23:50
  • CPU time is the most precious. Basically unpickling objects can take anywhere from 20 ms (for a small one) to 60 ms (for a big one). I personally feel that both of these times are too long. EDIT: Too long in the sense that there has to be a better way, not that i think cPickle isn't trying hard enough. – nickneedsaname Jul 03 '12 at 00:29
  • Sharing memory would be doable, but sharing objects will be seriously hard... A related question can be found here: http://stackoverflow.com/questions/1268252/python-possible-to-share-in-memory-data-between-2-separate-processes (there's a nice writeup of Alex Martelli explaining why this is hard). – ChristopheD Jul 03 '12 at 06:02
  • @f34r Is pickling known to be the primary or at least a significant bottleneck in your current codebase? If not, your CPU time is, in fact, in ample supply. You might be able to make a case for dumping Redis if your data isn't really persistent and replacing it with just sending pickled values directly. But my gut feeling is that you can't have a "share-nothing" distributed architecture without serialised messages, and they're much easier to reason about than shared memory systems. – millimoose Jul 03 '12 at 09:48
  • I suppose you could also look into the [sharing mechanisms of the `multiprocessing` module](http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes). (The first sentence of that paragraphs recommends against doing so of course.) – millimoose Jul 03 '12 at 09:54

4 Answers4

10

From Python 3.8 and onwards you can use multiprocessing.shared_memory.SharedMemory

Vinay Sharma
  • 164
  • 1
  • 7
6

Not unreasonable.

IPC can be done with a memory mapped file. Python has functionality built in:

http://docs.python.org/library/mmap.html

Just mmap the file in both processes and hey-presto you have a shared file. Of course you'll have to poll it in both processes to see what changes. And you'll have to co-operate writes between both. And decide what format you want to put your data in. But it's a common solution to your problem.

Joe
  • 46,419
  • 33
  • 155
  • 245
  • 8
    But this would still require serializing to bytes, yes? The OP said he was trying to avoid that. – Ned Batchelder Jul 02 '12 at 23:49
  • This would some kind of serialisation, yes. Perhaps custom serialisation would do a better job if the object type is known. Alternatively include a hash code to avoid re-deserialising an object twice. However this is done, serialisation is required. – Joe Jul 02 '12 at 23:54
  • Are you sure? Network + redis would be comparatively expensive. Why not profile it? – Joe Jul 03 '12 at 00:28
  • Redis is running on the machine, it's not a network redis server – nickneedsaname Jul 03 '12 at 01:04
  • Still, you're either doing it over a domain socket or a local TCP socket, right? – Joe Jul 03 '12 at 06:11
3

If you don't want pickling, multiprocessing.sharedctypes might fit. It's a bit low-level, though; you get single values or arrays of specified types.

Another way to distribute data to child processes (one way) is multiprocessing.Pipe. That can handle Python objects, and it's implemented in C, so I cannot tell you wether it uses pickling or not.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
1

Python do NOT support shared memory between independent processes. You can implement your own in C language, or use SharedArray if you are working with libsvm, numpy.ndarray, scipy.sparse.

pip install SharedArray
def test ():
    def generateArray ():
        print('generating')
        from time import sleep
        sleep(3)
        return np.ones(1000)
    a = Sarr('test/1', generateArray)

    # use same memory as a, also work in a new process
    b = Sarr('test/1', generateArray) 
    c = Sarr('test/1', generateArray)

import re
import SharedArray
import numpy as np

class Sarr (np.ndarray):
    def __new__ (self, name, getData):
        if not callable(getData) and getData is None:
            return None
        self.orig_name = name
        shm_name = 'shm://' + re.sub(r'[./]', '_', name)
        try:
            shm = SharedArray.attach(shm_name)
            print('[done] reuse shared memory:', name)
            return shm
        except Exception as err:
            self._unlink(shm_name)
            data = getData() if callable(getData) else getData
            shm = SharedArray.create(shm_name, data.size)
            shm[:] = data[:]
            print('[done] loaded data to shared memory:', name)
            return shm

    def _unlink (name):
        try:
            SharedArray.delete(name[len('shm://'):])
            print('deleted shared memory:', name)
        except:
            pass


if __name__ == '__main__':
    test()
Yin
  • 612
  • 7
  • 10