How to use Python and OpenCV with multiprocessing?

Question

I'm using Python 3.4.3 and OpenCV 3.0.0 to process (applying various filters to) a very large image (80,000 x 60,000) in memory and I'd like to use multiple CPU cores to improve performance. After some reading, I arrived at two possible method : 1) Use python's multiprocessing module, let each process deal with a slice of the large image and join the results after processing is done (And this probably should be performed on POSIX system?) 2) Since NumPy supports OpenMP and OpenCV uses NumPy, I can just leave the multiprocessing to NumPy?

So my question is :

Which one will be a better solution? (If they don't seem reasonable, what would be a possible approach? )

If Option 2 is good, should I build both NumPy and OpenCV with OpenMP ? How would I actually make multi-processing happen? ( I couldn't really find useful instruction..)

If numpy is built with a multicore implementation of [BLAS](https://stackoverflow.com/questions/11443302/compiling-numpy-with-openblas-integration), then regular operations (vector/matrix addition and vector/matrix multiplication) work in multicore. However, there is no an easy way of performing image convolutions with standard operations.. (as it requires a sliding window). The easiest way would be option 1, but I would suggest another one: use [hdf5](http://www.h5py.org/) to save memory in conjunction with option 1. — Imanol Luengo, Sep 25 '15 at 10:42
if openCV is compiled with multi-threading (TBB and/or OpenMP), some operations are performed multi threaded but others aren't. IPP async. might help too... — Micka, Sep 26 '15 at 08:52

score 5 · Accepted Answer · edited May 30 '19 at 17:12

After reading some SO posts, I've come up with a way to use OpenCV in Python3 with multiprocessing. I recommend doing this on linux, because according to this post, spawned processes share memory with their parent as long as the content is not changed. Here's a minimal example:

import cv2
import multiprocessing as mp
import numpy as np
import psutil

img = cv2.imread('test.tiff', cv2.IMREAD_ANYDEPTH) # here I'm using a indexed 16-bit tiff as an example.
num_processes = 4
kernel_size = 11
tile_size = img.shape[0]/num_processes  # Assuming img.shape[0] is divisible by 4 in this case

output = mp.Queue()

def mp_filter(x, output):
    print(psutil.virtual_memory())  # monitor memory usage
    output.put(x, cv2.GaussianBlur(img[img.shape[0]/num_processes*x:img.shape[0]/num_processes*(x+1), :], 
               (kernel_size, kernel_size), kernel_size/5))
    # note that you actually have to process a slightly larger block and leave out the border.

if __name__ == 'main':
    processes = [mp.Process(target=mp_filter, args=(x, output)) for x in range(num_processes)]

    for p in processes:
        p.start()

    result = []
    for ii in range(num_processes):
        result.append(output.get(True))

    for p in processes:
        p.join()

Instead of using Queue, another way to collect the result from the processes is to create a shared array through multiprocessing module. (Has to import ctypes)

result = mp.Array(ctypes.c_uint16, img.shape[0]*img.shape[1], lock = False)

Then each process can write to different portions of the array assuming there is no overlap. Creating a large mp.Array is surprisingly slow, however. This actually defies the purpose of speeding up the operation. So use it only when the added time is not much when compared with total computation time. This array can be turned into a numpy array by :

result_np = np.frombuffer(result, dtypye=ctypes.c_uint16)

GaussianBlur not GaussianBlue. Can't suggest an edit due to StackOverflow-s limit of 6 characters. — Eilyre, May 24 '17 at 07:01

jcupitt · Answer 2 · 2019-04-14T12:37:17.623

4

I don't know what types of filters you need, but if it's reasonably simple, you could consider libvips. It's an image processing system for very large images (larger than the amount of memory you have). It came out of a series of EU-funded scientific art imaging projects, so the focus is on the types of operation you need for image capture and comparison: convolution, rank, morphology, arithmetic, colour analysis, resampling, histograms, and so on.

It's fast (faster than OpenCV, on some benchmarks at least), needs little memory, and there's a high-level Python binding. It works on Linux, OS X and Windows. It handles all the multiprocessing for you automatically.

edited Apr 14 '19 at 12:37

answered Oct 06 '15 at 10:17

jcupitt

10,213
2
23
39

Thanks for the input. In terms of filtering, I perform Guassian and Laplacian of Gaussian on the image, which I believe VIPS will perform very well. I didn't go with VIPS, though because I also perform other operations on the image, such as logical operation between images. NumPy makes this easier and this is the reason I chose OpenCV rather than VIPS. – user3667217 Oct 09 '15 at 08:07
2

By logical operations, do you mean pixelwise and/or/eor? libvips has that, eg. `a = (a << 8) & b ^ c` etc. – jcupitt Oct 09 '15 at 09:24
1

Yes. Thank you for your comment. I actually tried VIPS. I could only make image convolution work, however. It indeed performs much better. I've started a new question here : http://stackoverflow.com/questions/33195055/how-to-perform-logical-operation-and-logical-indexing-using-vips-in-python Also, I managed to make multiprocessing work with openCV, and I'll post an answer. – user3667217 Oct 18 '15 at 06:38
Hi, I posted some stuff on your new question, hope it helps. – jcupitt Oct 19 '15 at 08:15

score 3 · Answer 3 · answered Feb 21 '19 at 00:40

This can be done cleanly with Ray, which is a library for parallel and distributed Python. Ray reasons about "tasks" instead of using a fork-join model, which gives some additional flexibility (e.g., you an put values in shared memory even after forking worker processes), the same code runs on multiple machines, you can compose tasks together, etc.

import cv2
import numpy as np
import ray

num_tasks = 4
kernel_size = 11


@ray.remote
def mp_filter(image, i):
    lower = image.shape[0] // num_tasks * i
    upper = image.shape[0] // num_tasks * (i + 1)
    return cv2.GaussianBlur(image[lower:upper, :],
                            (kernel_size, kernel_size), kernel_size // 5)


if __name__ == '__main__':
    ray.init()

    # Load the image and store it once in shared memory.
    image = np.random.normal(size=(1000, 1000))
    image_id = ray.put(image)

    result_ids = [mp_filter.remote(image_id, i) for i in range(num_tasks)]
    results = ray.get(result_ids)

Note that you can store more than just numpy arrays in shared memory, you can also benefit if you have Python objects that contain numpy arrays (like dictionaries containing numpy arrays). Under the hood, this uses the Plasma shared-memory object store and the Apache Arrow data layout.

You can read more in the Ray documentation. Note that I'm one of the Ray developers.

Any benchmark tests against `multiprocessing` default from python (3.x)? If so..share a link. (not interested in Pong). — ZF007, Feb 21 '19 at 01:24
I don't have a link, but the biggest difference will be for large numerical data. E.g., if `x = np.zeros(10**8)` and `f` is the identity function (or identity remote function in Ray), then on my laptop, `%time pool.apply_async(f, args=(x, )).get()` takes 15 seconds and `%time ray.get(f.remote(x))` takes 1.5 seconds. — Robert Nishihara, Feb 21 '19 at 18:35
@RobertNishihara Out of left field: Do you know of anyone calling Ray from within Flask? — jtlz2, Sep 25 '19 at 10:38

How to use Python and OpenCV with multiprocessing?

3 Answers3

Linked