Create shared memory Numpy ndarray without memory copy

Question

I’m trying to create Numpy array that can be access by other process on the same machine extremely fast. After a bunch of research and testing, I decided to try Python 3.8 shared memory to achieve this with the hope that using shared memory can allow sharing large numpy array between process at sub-milliseconds speed.

The implementation will be something like this:

On the first ipython shell (on the first process):

import time
import numpy as np
from multiprocessing.shared_memory import SharedMemory

arr = np.random.randint(0, 255, (5000, 5000, 4), dtype=np.uint8)
shm = SharedMemory(create=True, size=arr.nbytes)

shm_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
shm_arr[:] = arr[:]

On the second ipython shell (on the second process):

import numpy as np
import time
from multiprocessing.shared_memory import SharedMemory

shm = SharedMemory(name='test')
shm_arr = np.ndarray([5000, 5000, 4], dtype=np.uint8, buffer=shm.buf)
shm.close()

But there is a bottleneck here. For the creation of the shared numpy array on the first process, I need to use:

shm_arr[:] = arr[:]

Which mean there is a memory copy operation from arr to shm_arr. For large array, this can take a lot of time. For examples, the above 5000 x 5000 x 4 array tooks about 55 ms for that assignment call alone. This make the whole thing only 20% faster than serialize the the whole array with pickle at max in my testing. Reconstruction time on the other hand is around 5ms, which is still not sub-milliseconds.

Question:

Are there anything that I’m doing wrong here ?
Are there any ways to create the shared numpy array on the first process without memory copy so it can be much faster ?

Thanks for checking by.

The memory copy ```shm_arr[:] = arr[:]``` is independent of ```shm_arr```. It takes just as much to do this operation even if ```shm_arr = np.empty_like(arr)```. The best solution here would be if you could specify the result buffer from calling ```np.random.randint``` and prevent any temporary memory allocation. — Kevin, May 13 '21 at 10:01
Thanks @Kevin. Can you show me an example of how to specify the result buffer from calling `np.random.randint` to prevent any temporary memory allocation ? — lamhoangtung, May 13 '21 at 10:04
I am not sure if it is possible to specify the ```out``` keyword on ```np.random.randint```. But is it necessary in your implementation to initialize the shared memory? — Kevin, May 13 '21 at 10:08
Of course I need the shared memory so multiple process can access the array at the same time. `np.random.randint` is just a test. In real-life, my array should be constructed something like: `cv2.imdecode(np.frombuffer(image, np.uint8), -1)` with image is bytes — lamhoangtung, May 13 '21 at 10:13
By default the shared memory ```shm_arr``` is initialized with zeros and I cannot see any solution for changing this initialization scheme. I don't know enough about your application, but maybe you only need to initialize the shared memory once. If (for the sake of argument) the shared memory was on another device (GPU), it would be very expensive transfer memory like this back and forth and it would in some cases subdue the increase of speed of doing the actual computation. — Kevin, May 13 '21 at 10:25
Even if I only initialize the shared memory once, every time I replace it value with a new array (assuming my array always have the same shape), how can it avoid that 55ms assignment call because of memory copy ? — lamhoangtung, May 13 '21 at 10:41
I assume you are getting this new array from doing some computation on the shared memory? You can specify where the result buffer is and do the operations in-place. — Kevin, May 13 '21 at 10:45
Nope. Picture that I have the first process an array (image) reader, it can fetch in new image extremely fast, it init shared memory for each array and put the name of each shared array into a queue to another process. The other process get the shared array name from the queue and access the image for computation. So actually the shared array should never be changed, it just need to move from a process to another fast. — lamhoangtung, May 13 '21 at 10:52
Oh, I see. I have no idea sorry. This sounds a bit like a [dataloader](https://pytorch.org/docs/stable/data.html#multi-process-data-loading), but I have never done anything like that before. — Kevin, May 13 '21 at 11:00
Yeah, I expect shared mem to be faster than dataloader since those are using serialization — lamhoangtung, May 13 '21 at 11:35
@lamhoangtung perhaps take a look at a `cv2` [example I wrote for another question](https://stackoverflow.com/a/66522825/3220135). I was able to read a video file (much smaller dims than 5000x5000) at around 700 fps by using basically zero copy. if you take a similar approach, I'd be willing to bet you've come up against the speed limit of python itself — Aaron, May 14 '21 at 21:22
@lamhoangtung don't expect miracles... you're talking about a lot of data per array (5k x 5k x 4 x 8bytes = 0.74 GiB). Expecting sub millisecond is pretty extreme even with absolutely minimal copy. low latency with huge data usually starts edging into the realm of fpga or asic hardware. Definitely not python territory. — Aaron, May 15 '21 at 17:29
For reference, if you want 1kHz with that frame size, you're talking about as much memory bandwidth as an RTX 3080 (760 Gbps). (and that's theoretic performance on-chip with no communication to the cpu) I don't really think your list of requirements are possible... (frame size x 1khz) — Aaron, May 15 '21 at 17:44
**edit** my bad, I was looking at `uint8` as 8 bytes not 8 bit... closer to doable, but still not easy. — Aaron, May 15 '21 at 17:57
Thank @Aron. So `uint8` should be 8 bit right, which equal to about 84 Gbps for 1KHz for absolutely minimal copy. Typical high end Intel CPU should have max memory bandwidth of 45.8 GB/s. So in theory I can expect around 500Hz right ? — lamhoangtung, May 16 '21 at 03:18

Create shared memory Numpy ndarray without memory copy

0 Answers0