How to properly implement python multiprocessing for expensive image/video tasks?

Question

I'm running on a pretty basic quad-core machine where multiprocessing.cpu_count() = 8 with something like:

from itertools import repeat
from multiprocessing import Pool


def expensive_function(list_of_values, some_param, another_param):
    do_some_python_pillow_tasks()
    do_some_ffmpeg_tasks()


if __name__ == '__main__':
    values = [
        ['a', 'b', 'c'],
        ['x', 'y', 'z'],
        # ...
        # there can be MANY items in this list, let's say 1000
    ]
    
    pool = Pool(processes=len(values))
    pool.starmap(
        expensive_function,
        zip(values, repeat('yada yada yada'), repeat('hello world')),
    )
    pool.close()

None of the 1,000 tasks will have problems with each other, in theory they can all be run at the same time.

Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?

Ultimately I want all (potentially 1000) tasks to complete as fast as possible. This may be a stupid question, but can you utilize the GPU to help speed up processing?

score 0 · Answer 1 · edited Jan 10 '22 at 14:23

Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?

Pool creates many CPython processes and processes is the number of workers to create. Creating about 1000 processes is really not a good idea since creating a process is expensive. I advise you to let the default parameter (or to check if using 4 processes is better in your case).

This may be a stupid question, but can you utilize the GPU to help speed up processing?

No. You cannot use it transparently. You need to rewrite your code to use it and this is generally pretty hard. However, the ffmpeg may use it already. If so, running this task in parallel should certainly not be much faster (it can actually even be slower) since the GPU is a shared resource and the multiple process will compete for its use (since GPU tasks are always massively parallel in practice).

score 0 · Answer 2 · edited Jan 13 '22 at 22:07

Q : _{" ... am I using multiprocessing to the best of it's ability ?"}

A :
Well, that actually does not matter here at all.

Congratulations!
You happened to enjoy a such seldom use-case, where the so called embarrasingly parallel process-orchestration may save most of otherwise present problems.

_{Incidentally, nothing new, this very same, exactly the same use-case, reasoning was successfully used by Peter Jackson's VFX-team for his "Lord of The Rings" frame-by-frame video-rendering & video-postprocessing & final LASER-deposition of each frame on color-film band computer power-plant setup in New Zealand. Except his factory was full of Silicon Graphics' workstations ( no Python reported to have been there ), yet the workflow orchestration principle was the same ...}

Python multithreading is irrelevant here, as it keeps all threads still stand in a queue and wait one after another for its turn in acquiring the one-&-only-one central Python GIL-lock, so using it is rather an anti-pattern if you wish to gain processing speed here.

Python multiprocessing is inappropriate here, even for as small number as 4 or 8 worker-processes ( the less for ~1k promoted above ), as it

first
spends (in further context negligible [TIME]- and [SPACE]-domains costs on each spawning of a new, independent Python-interpreter processes, copied full-scale, i.e. with all its internal-state & all the data-structures (! expect RAM-/SWAP-thrashing whenever your host physical-memory gets over-saturated with that many copies of the same things & virtual-memory management-service of the O/S starts to, concurrently to running your "useful" work, orchestrate memory SWAP-ins / SWAP-outs, as it thinks the just-O/S-scheduled-process needs to fetch data, that cannot fit/stay in-RAM and so gets not N x 100 [ns] far from CPU, but Q x 10.000.000 [ns] far on-HDD - yes, you read correctly, suddenly being many orders of magnitude slower just to re-read the "own" data, accidentally swapped away + CPU gets the less available for your processing, as it has to perform also all the introduced SWAP-I/O processing. Nasty, isn't it? Yet, it is not all, what hurts you... )

next ( and repeated per each of the 1.000 cases ... )
you will have to pay ( CPU-wise + MEM-I/O-wise + O/S-IPC-wise )
another awful penalty, here for moving data ( parameters ) from the "main" Python-interpreter process to the "spawned" Python-interpreter process, using DATA-SERialiser( at CPU + MEM-I/O add-on costs ) + DATA-moving( O/S-IPC-service add-on costs, yes, DATA-size matters, again ) + DATA-DESerialise( again at CPU + MEM-I/O add-on costs ) all doing that just to make DATA ( parameters ) somehow appear "inside" the other Python-interpreter, whose GIL-lock will not compete with your central and other Python-interpreters ( which is fine, yet on this awfully gigantic sum of add-on costs? Not so nice looking as we get understand details, is it? )

What can be done instead?

a)
split the list ( values ) of independent values, as was posted above, in say 4 parts ( quad-core, 2 hardware-threads each, CPU ), and

b)
let the embarrasingly parallel (independent) problem get solved in a pure-[SERIAL] fashion, by 4 Python processes, each one launched fully independent, on respective quarter of the list( values )

There will be zero add-on cost for doing so,
there will be zero add-on SER/DES penalty for 1000+ tasks' data distribution and results' recollection, and
there will be reasonable CPU-core distributed workload ( thermal throttling will, as the CPU-core temperatures may and will grow, appear for all of them - so no magic but sufficient CPU-cooling can save us here anyway )

One may also test, whether PIL.Image processing could get faster, if using OpenCV with numpy.ndarray() smart vectorised processing tricks, yet these are another Level-of-Detail of boosting performance, once we prevent those gigantic overheads costs reminded above.

Except for using a magic wand, there is no other magic possible on Python-interpreter here

How to properly implement python multiprocessing for expensive image/video tasks?

2 Answers2

What can be done instead?