3

I'm trying to find a reasonable approach in Python for a real-time application, multiprocessing and large files.

A parent process spawn 2 or more child. The first child reads data, keep in memory, and the others process it in a pipeline fashion. The data should be organized into an object,sent to the following process, processed,sent, processed and so on.

Available methodologies such as Pipe, Queue, Managers seem not adequate due to overheads (serialization, etc).

Is there an adequate approach for this?

NovoRei
  • 261
  • 1
  • 3
  • 7
  • 1
    First, "seem not adequate" isn't a good argument without either some profiling numbers, or an analysis of why you expect serialization to take longer than the actual work. It's often faster and easier to just come up with some good sample data, build a proof of concept, and test it. If it turns out that you're spending only 2% of the time in pickling and sending stuff over the queue, you don't have a problem in the first place. – abarnert Oct 24 '14 at 19:03

5 Answers5

2

I've used Celery and Redis for real-time multiprocessing in high memory applications, but it really depends on what you're trying to accomplish.

The biggest benefits I've found in Celery over built-in multiprocessing tools (Pipe/Queue) are:

  • Low overhead. You call a function directly, no need to serialize data.
  • Scaling. Need to ramp up worker processes? Just add more workers.
  • Transparency. Easy to inspect tasks/workers and find bottlenecks.

For really squeezing out performance, ZMQ is my go to. A lot more work to set up and fine-tune, but it's as close to bare sockets as you can safely get.

Disclaimer: This is all anecdotal. It really comes down to what your specific needs are. I'd benchmark different options with sample data before you go down any path.

nathancahill
  • 10,452
  • 9
  • 51
  • 91
  • These technologies are still serializing data; you can't get the arguments to a function over the wire without passing them over the wire. They're just forcing you not to do it in a naive way that ends up pickling a whole bunch of complicated stuff, and generally being more efficient in the way they do pass your stuff. Well, "just" is a bit of a loaded word; both are a huge benefit in practice, but you know what I mean. – abarnert Oct 24 '14 at 19:19
2

First, a suspicion that message-passing may be inadequate because of all the overhead is not a good reason to overcomplicate your program. It's a good reason to build a proof of concept and come up with some sample data and start testing. If you're spending 80% of your time pickling things or pushing stuff through queues, then yes, that's probably going to be a problem in your real life code—assuming the amount of work your proof of concept does is reasonably comparable to your real code. But if you're spending 98% of your time doing the real work, then there is no problem to solve. Message passing will be simpler, so just use it.

Also, even if you do identify a problem here, that doesn't mean that you have to abandon message passing; it may just be a problem with what's built in to multiprocessing. Technologies like 0MQ and Celery may have lower overhead than a simple queue. Even being more careful about what you send over the queue can make a huge difference.


But if message passing is out, the obvious alternative is data sharing. This is explained pretty well in the multiprocessing docs, along with the pros and cons of each.

Sharing state between processes describes the basics of how to do it. There are other alternatives, like using mmapped files of platform-specific shared memory APIs, but there's not much reason to do that over multiprocessing unless you need, e.g., persistent storage between runs.

There are two big problems to deal with, but both can be dealt with.

First, you can't share Python objects, only simple values. Python objects have internal references to each other all over the place, the garbage collector can't see references to objects in other processes' heaps, and so on. So multiprocessing.Value can only hold the same basic kinds of native values as array.array, and multiprocessing.Array can hold (as you'd guess by the name) 1D arrays of the same values, and that's it. For anything more complicated, if you can define it in terms of a ctypes.Structure, you can use https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.sharedctypes, but this still means that any references between objects have to be indirect. (For example, you often have to store indices into an array.) (Of course none of this is bad news if you're using NumPy, because you're probably already storing most of your data in NumPy arrays of simple values, which are sharable.)

Second, shared data are of course subject to race conditions. And, unlike multithreading within a single process, you can't rely on the GIL to help protect you here; there are multiple interpreters that can all be trying to modify the same data at the same time. So you have to use locks or conditions to protect things.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I want to share numpy random state of a parent process with a child process. I've tried using `Manager` but still no luck. Could you please take a look at my question [here](https://stackoverflow.com/questions/49372619/how-to-share-numpy-random-state-of-a-parent-process-with-child-processes) and see if you can offer a solution? I can still get different random numbers if I do `np.random.seed(None)` every time that I generate a random number, but this does not allow me to use the random state of the parent process, which is not what I want. Any help is greatly appreciated. – Amir Mar 20 '18 at 02:34
0

For multiprocessing pipeline check out MPipe.

For shared memory (specifically NumPy arrays) check out numpy-sharedmem.

I've used these to do high-performance realtime, parallel image processing (average accumulation and face detection using OpenCV) while squeezing out all available resources from a multi-core CPU system. Check out Sherlock if interested. Hope this helps.

Velimir Mlaker
  • 10,664
  • 4
  • 46
  • 58
0

One option is to use something like brain-plasma that maintains a shared-memory object namespace that is independent of the Python process or thread. Kind of like Redis but can be used with big objects and has a simple API, built on top of Apache Arrow.

$ pip install brain-plasma
# process 1
from brain_plasma import Brain
brain = Brain()
brain['myvar'] = 657
# process 2
from brain_plasma import Brain
brain = Brain()
brain['myvar']
# >>> 657
russellthehippo
  • 402
  • 4
  • 10
0

Python 3.8 now offers shared memory access between processes using multiprocessing.shared_memory. All you hand off between processes is a string that references the shared memory block. In the consuming process you get a memoryview object which supports slicing without copying the data like byte arrays do. If you are using numpy it can reference the memory block in an O(1) operation, allowing fast transfers of large blocks of numeric data. As far as I understand generic objects still need to be deserialized since a raw byte array is what's received by the consuming process.

David Parks
  • 30,789
  • 47
  • 185
  • 328