I have a dictionary of python pandas dataframes. The total size of this dictionary is about 2GB. However, when I share it across 16 multiprocessing (in the subprocesses I only read the data of the dict without modifying it), it takes 32GB ram. So I would like to ask if it is possible for me to share this dictionary across multiprocessing without copying it. I tried to convert it to manager.dict(). But it seems it takes too long. What would be the most standard way to achieve this? Thank you.
-
this is an important question when starting to do larger scale data science with python , pandas and mpi. – user40780 Jan 29 '18 at 16:54
-
2See if this helps - https://stackoverflow.com/q/6832554/2650427 – TrigonaMinima Jan 29 '18 at 18:57
-
Maybe you can use threads instead of processes. – TrigonaMinima Jan 29 '18 at 18:58
-
1+1 You may be interested in this question here too [share-a-dict-with-multiple-python-scripts](https://stackoverflow.com/questions/48409610/share-a-dict-with-multiple-python-scripts/48494892#48494892) and the answer there [when-is-copy-on-write-invoked-for-python-multiprocessing-across-class-methods](https://stackoverflow.com/questions/38084401/when-is-copy-on-write-invoked-for-python-multiprocessing-across-class-methods) – Darkonaut Jan 31 '18 at 05:50
-
...and there [multiprocessing-module-showing-memory-for-each-child-process-same-as-main-proces](https://stackoverflow.com/questions/10369219/multiprocessing-module-showing-memory-for-each-child-process-same-as-main-proces) – Darkonaut Jan 31 '18 at 05:51
-
Not a direct solution for your problem, but you may find an alternative approach here: http://dask.pydata.org/ Dask promises to enable parallel computation on dataframe-like objects larger than memory. – fpbhb Feb 01 '18 at 18:02
-
For which OS are you searching a solution? – Darkonaut Feb 02 '18 at 16:15
-
@Darkonaut, MAC OS and linux – user40780 Feb 02 '18 at 16:15
-
When you say it takes too long, do you mean that retrieval is too slow and your multiprocessing has to wait for each retrieval or that it takes too long to setup? I'm also curious how big of a slice (ie.. percent of overall size) of the dictionary you need to take at one time and how often you need to grab the data (ie.. almost continuously, or is there significant processing taking place after you get the data). – bivouac0 Feb 02 '18 at 17:54
1 Answers
The best solution I've found (and it only works for some types of problems) is to use a client/server setup using Python's BaseManager and SyncManager classes. To do this you first setup a Server that serve's up a proxy class for the data.
DataServer.py
#!/usr/bin/python
from multiprocessing.managers import SyncManager
import numpy
# Global for storing the data to be served
gData = {}
# Proxy class to be shared with different processes
# Don't put big data in here since that will force it to be piped to the
# other process when instantiated there, instead just return a portion of
# the global data when requested.
class DataProxy(object):
def __init__(self):
pass
def getData(self, key, default=None):
global gData
return gData.get(key, None)
if __name__ == '__main__':
port = 5000
print 'Simulate loading some data'
for i in xrange(1000):
gData[i] = numpy.random.rand(1000)
# Start the server on address(host,port)
print 'Serving data. Press <ctrl>-c to stop.'
class myManager(SyncManager): pass
myManager.register('DataProxy', DataProxy)
mgr = myManager(address=('', port), authkey='DataProxy01')
server = mgr.get_server()
server.serve_forever()
Run the above once and leave it running. Below is the client class you use to access the data.
DataClient.py
from multiprocessing.managers import BaseManager
import psutil #3rd party module for process info (not strictly required)
# Grab the shared proxy class. All methods in that class will be availble here
class DataClient(object):
def __init__(self, port):
assert self._checkForProcess('DataServer.py'), 'Must have DataServer running'
class myManager(BaseManager): pass
myManager.register('DataProxy')
self.mgr = myManager(address=('localhost', port), authkey='DataProxy01')
self.mgr.connect()
self.proxy = self.mgr.DataProxy()
# Verify the server is running (not required)
@staticmethod
def _checkForProcess(name):
for proc in psutil.process_iter():
if proc.name() == name:
return True
return False
Below is the test code to try this with multiprocessing.
TestMP.py
#!/usr/bin/python
import time
import multiprocessing as mp
import numpy
from DataClient import *
# Confusing, but the "proxy" will be global to each subprocess,
# it's not shared across all processes.
gProxy = None
gMode = None
gDummy = None
def init(port, mode):
global gProxy, gMode, gDummy
gProxy = DataClient(port).proxy
gMode = mode
gDummy = numpy.random.rand(1000) # Same as the dummy in the server
#print 'Init proxy ', id(gProxy), 'in ', mp.current_process()
def worker(key):
global gProxy, gMode, gDummy
if 0 == gMode: # get from proxy
array = gProxy.getData(key)
elif 1 == gMode: # bypass retrieve to test difference
array = gDummy
else: assert 0, 'unknown mode: %s' % gMode
for i in range(1000):
x = sum(array)
return x
if __name__ == '__main__':
port = 5000
maxkey = 1000
numpts = 100
for mode in [1, 0]:
for nprocs in [16, 1]:
if 0==mode: print 'Using client/server and %d processes' % nprocs
if 1==mode: print 'Using local data and %d processes' % nprocs
keys = [numpy.random.randint(0,maxkey) for k in xrange(numpts)]
pool = mp.Pool(nprocs, initializer=init, initargs=(port,mode))
start = time.time()
ret_data = pool.map(worker, keys, chunksize=1)
print ' took %4.3f seconds' % (time.time()-start)
pool.close()
When I run this on my machine I get...
Using local data and 16 processes
took 0.695 seconds
Using local data and 1 processes
took 5.849 seconds
Using client/server and 16 processes
took 0.811 seconds
Using client/server and 1 processes
took 5.956 seconds
Whether this works for you in your multiprocessing system depends on how often have to grab the data. There's a small overhead associated with each transfer. You can see this if you turn down the number of iterations in the x=sum(array)
loop. At some point you'll spend more time getting data than working on it.
Besides multiprocessing, I also like this pattern because I only have to load my big array data once in the server program and it stays loaded until I kill the server. That means I can run a bunch of separate scripts against the data and they execute quickly; no waiting for data to load.
While the approach here is somewhat similar to using a database, it has the advantage of working on any type of python object, not just simple DB tables of strings and ints, etc. I've found that using a DB is a bit faster for those simple types but for me, it tends to be more work programatically and my data doesn't always port over easily to a database.

- 2,494
- 1
- 13
- 28
-
I should note that I have this implemented in a simple library at... https://github.com/bjascob/PythonDataServe – bivouac0 Jan 31 '20 at 14:47