Q :
" ... trying to learn about multiprocessing in order to solve a problem at work. "
A :
the single most important piece of experience to learn
is how BIG are the COSTS-of-( process )-INSTANTIATION(s),
all other add-on overhead costs
( still by no means not negligible, the more in growing the scales of the problem )
are details in comparison to this immense & principal one.
Before the answer is read-through and completely understood, here is a Live GUI-interactive simulator of how much we will have to pay to start using more than 1 stream of process-flow orchestration ( costs vary - lower for threads, larger for MPI-based distributed operations, highest for multiprocessing
-processes, as used in Python Interpreter, where N-many copies of the main Python Interpreter process get first copied (RAM-allocated and O/S scheduler spawned - as 2022-Q2 still reports issues if less expensive backends try to avoid this cost, yet at problems with deadlocking on wrong-shared or ill-copied or forgotten to copy some already blocking MUTEX-es and similar internalities - so that even the full-copy is not always safe in 2022 - not having met them in person does not mean these do not still exist, as documented by countless professionals - a story about a pool of sharks is a good place to start from )

Inventory of problems :
a ) pickling lambdas ( and many other SER/DES blockers )
is easy - it is enough to conda install dill
and import dill as pickle
as dill can, for years pickle them - credit to @MikeMcKearns and your code does not need to refactor the use of the plain pickle.dumps()
-call interface. So using pathos.multiprocess
defaults to use dill
internally, and this, for years known multiprocessing
SER/DES weakness gets avoided.
b ) performance killers
- multiprocessing.Pool.map()
is rather an End-to-End performance anti-pattern here - The Costs..., if we start not to neglect them, show, how much CPU-clocks & blocked physical-RAM-I/O transfers are paid for so many process-instantiations ( 60+ ), which finally "occupy" almost all physical CPU-cores, yet leaving thus almost zero-space for indeed high performance numpy
-native multicore-computing of the core-problem ( for which the ultimate performance was expected to be boosted up, wasn't it? )
- just move the p-slider in the simulator to anything less than 100% ( having no [SERIAL]
-part of the problem execution, which is nice in theory, yet never doable in practice, even the program launch is a pure-[SERIAL]
, by design )
- just move the Overhead-slider in the simulator to anything above a plain zero ( expressing a relative add-on cost of spawning a one of NCPUcores processes, as a number of percent, relative to the such [PARALLEL]
-section part number of instructions - mathematically "dense" work has many such "useful"-instructions and may, supposing no other performance killers jump out of the box, may spends some reasonable amount of "add-on" costs, to spawn some amount of concurrent- or parallel-operations ( the actual number depends only on actual economy of costs, not on how many CPU-cores are present, the less on our "wishes" or scholastic or even worse copy/paste-"advice" ). On the contrary, mathematically "shallow" work has almost always "speedups" << 1 ( immense slow-downs ), as there is almost no chance to justify the known add-on costs ( paid on process-instantiations, data SER/xfer/DES moving in (params) and back (results) )
- next move the Overhead-slider in the simulator to the righmost edge == 1
. This shows the case, when the actual process-spawning overhead-( a time lost )-costs are still not more than a just <= 1 %
of all the computing-related instructions next, that are going to be performed for the "useful"-part of the work, that will be computed inside the such spawned process-instance. So even such 1:100 proportion factor ( doing 100x more "useful"-work than the lost CPU-time, burnt for arranging that many copies and making O/S-scheduler orchestrate concurrent execution thereof inside the available system Virtual-Memory ) has already all the warnings visible in the graph of the progression of Speedup-degradation - just play a bit with the Overhead-slider in the simulator, before touching the others...
- avoid a sin of "sharing" ( if performance is the goal ) - again, costs of operating such orchestration among several Python Interpreter processes, now independent, takes additional add-on costs, never justified in gaining performance boosted, as the fight for occupying shared resources ( CPU-cores, physical-RAM-I/O channels ) only devastates CPU-core-cache re-use hit-rates, O/S-scheduler operated process context-switches and all this further downgrades resulting End-to-End performance (which is something we do not want, do we?)
c ) boosting performance
- respect facts about the actual costs of any kind of computing operation
- avoid "shallow"-computing steps,
- maximise what gets so expensively into a set of distributed-processes (if a need remains so),
- avoid all overhead-adding operations (like adding local temporary variables, where inline operations permit to inplace store of partial results)
and
- try go into using the ultra-performant, cache-line friendly & optimised, native numpy
-vectorised multicore & striding-tricks capabilities, not blocked by pre-overloaded CPU-cores by scheduling so many (~60) Python Interpreter process copies, each one trying to call numpy
-code, thus not having any free cores to actually place such high-performance, cache-reuse-friendly vectorised computing onto (there we get-or-loose most of the performance, not in slow-running serial-iterators, not in spawning 60+ process-based full-copies of "__main__
" Python Interpreter, before doing a single piece of the useful work on our great data, expensively RAM-allocated and physically copied 60+ times therein)
- refactoring of the real problem shall never go against a collected knowledge about performance as repeating the things that do not work will not bring any advantage, will it?
- respect your physical platform constraints, ignoring them will degrade your performance
- benchmark, profile, refactor
- benchmark, profile, refactor
- benchmark, profile, refactor
no other magic wand available here
and once already working on the bleeding edge of performance, set gc.disable()
before you spawn the Python Interpreter into N-many replicated copies, not to wait for spontaneous garbage-collections when going for ultimate performance