How can I share a large array of data between processes without duplicating? (IPC)

Question

I have a process that makes an http get request to some API every single second. This process then does some sort of work with the JSON string that is returned from the request. At the same time, I want to pass this JSON string to another process to do something else.

A key point is that I want to keep appending each new JSON string to the end of an array so that all the JSON strings are saved in memory for the duration of the program.

I'm looking into using pipes or mmap to share the JSON strings as these seem to be the fastest ways I know of. However, it seems like pipes will do more work than necessary because I will have to send the strings which seems unnecessary since they are already in memory.

I thought that mmap would be able to avoid this problem if I could just map the array to both processes. I don't want the strings written to a real file on disk, so I would have to use an anonymous mapping, but it seems that anonymous mappings create a region that is zeroed and then I will have to copy my array into the memory mapped region to share the data. This seems like unnecessary work too because I'm duplicating memory.

Is there a way that I can just give the 2nd process access to the array of strings with no duplication of the strings or copying? I can't use threads even though threads can share the memory easily because the process making the REST requests will be a C++ program while the 2nd process will be a Python program.

When you say "large" what do you mean by that? Several kilobytes? Hundreds of kilobytes? Megabytes? — Some programmer dude, May 28 '20 at 05:26
As for your problem, there are many ways of [inter-process communication](https://en.wikipedia.org/wiki/Inter-process_communication), some (like shared memory) which can work without duplicating data. You do need some kind of synchronization and interlocking in place for shared memory to work though. — Some programmer dude, May 28 '20 at 05:27
@Someprogrammerdude The array will be a couple gigabytes by the time the program terminates. Also, is mmap not shared memory? I was under the impression that people used mmap to share memory. — user3567004, May 28 '20 at 05:28
You can use `mmap` to *map* a shared-memory segment into your process, but there needs to be some other work around it to create or load the actual shared memory segment. There are many tutorials available on how to do that. — Some programmer dude, May 28 '20 at 05:34
@Someprogrammerdude I read the following post about sharing memory and the accepted answer seemed to imply that mmap was the modern way of sharing memory whereas the shared memory functions that start with "shm" in linux seem to be the old way of doing things. https://stackoverflow.com/questions/5656530/how-to-use-shared-memory-with-linux-in-c — user3567004, May 28 '20 at 05:36
As for the data and amount of it, do you really need to have *all* of the data in memory at once? What requirements do you have that force you to keep it all in memory? A simple JSON string should not be more than a few kilobytes at most, typically. How about a shared circular queue for the JSON messages? — Some programmer dude, May 28 '20 at 05:36
In the accepted answer, you see the flag `MAP_ANONYMOUS`? That means it's an anonymous shared memory segment, that can only be used if you `fork` processes yourself from inside your program, and more importantly you need to keep track of the pointer returned by `mmap` (which must be called *before* the `fork`) as it's the pointer what needs to be used by all the forked child processes. If you have two independent processes, or load an independent program through an `exec` call, you need a *named* memory segment. — Some programmer dude, May 28 '20 at 05:40

score 1 · Answer 1 · answered May 28 '20 at 06:03

Take a look at boost interprocess.

You need to create a memory-mapped file, which is passed to a boost.interprocess allocator. This allocator takes chunks of the memory-mapped file. These can then be used as if they were returned by a std::allocator. With mapping applied so that the memory is compatible to in-process specific memory.

The boost.interprocess container will use the memory returned by the allocator. It has a std::container like interface.

You need to synchronize the data. This is done with some interprocess mutex and should be used to prevent data access race conditions.

Alternatively, the Message Passing Interface (MPI) is a technique to communicate between processes. This is usually done for large computations, such as car crash simulations, etc.

Using MPI you start multiple processes, which usually compute the same thing with different input data.

MPI offers the possibility to create an MPI Window. With the help of this window, you can write into the memory of the other MPI processes.

Thus, the answer to you question is yes. But it comes with quite some costs. The MPI scenario requires a starter of your processes. You can run into deadlock, data races, ...

How can I share a large array of data between processes without duplicating? (IPC)

1 Answers1