I am writing my first multiprocessing
program using Python 3.8. I have a large dataframe that I want all the processes to use. All the processes require only read-only access to the dataframe.
I have read that shared memory is one way of achieving this. However I'm not sure how exactly to allow each process to access my large dataframe using shared memory.
So in my head I do the lines like below.
# my dataframe, believe I need to get the number of bytes before creating a shared memory object
df_bytes = df.memory_usage(index=True).sum()
from multiprocessing import shared_memory
shm = shared_memory.SharedMemory(create=True, size=df_bytes)
memory_name = shm.name
From my limited understanding the shared memory object is assigned a name which will be used later.
Then in some function which is used by each process I make use of the shared memory object.
existing_shm = shared_memory.SharedMemory(name=memory_name)
But I am not sure how this lets me reference my large dataframe? I was trying to follow the example below, however its for numpy & I don't want to have to convert my dataframe to a numpy array. It looks like they have to re-create the numpy array for each process which doesn't make sense to me.
>>> # In the first Python interactive shell
>>> import numpy as np
>>> a = np.array([1, 1, 2, 3, 5, 8]) # Start with an existing NumPy array
>>> from multiprocessing import shared_memory
>>> shm = shared_memory.SharedMemory(create=True, size=a.nbytes)
>>> # Now create a NumPy array backed by shared memory
>>> b = np.ndarray(a.shape, dtype=a.dtype, buffer=shm.buf)
>>> b[:] = a[:] # Copy the original data into shared memory
>>> b
array([1, 1, 2, 3, 5, 8])
>>> type(b)
<class 'numpy.ndarray'>
>>> type(a)
<class 'numpy.ndarray'>
>>> shm.name # We did not specify a name so one was chosen for us
'psm_21467_46075'
>>> # In either the same shell or a new Python shell on the same machine
>>> import numpy as np
>>> from multiprocessing import shared_memory
>>> # Attach to the existing shared memory block
>>> existing_shm = shared_memory.SharedMemory(name='psm_21467_46075')
>>> # Note that a.shape is (6,) and a.dtype is np.int64 in this example
>>> c = np.ndarray((6,), dtype=np.int64, buffer=existing_shm.buf)
>>> c
array([1, 1, 2, 3, 5, 8])
>>> c[-1] = 888
>>> c
array([ 1, 1, 2, 3, 5, 888])
Does anyone have any pointers on how to put a large dataframe into shared memory?