Q : "How this computation could be elegantly parallelized via joblib
or other widely used library?"
If using joblib
, the main python interpreter will spawn other, GIL-lock independent copies of its own ( yes, huge memory-I/O to copy all python interpreter state, including all data-structures in O/S Windows, somewhat less horrible initial latency hit in linux-type O/S ), yet the worse is only to come - any "remote" modification of the spawned/distributed replicas of the original data have to somehow make it back to the main-python-process ( yes, huge memory-I/O + cache-(de)coherency hardware workloads (plus per-core L1-data cache-efficiency almost for sure devastated) ).
So this trick does not easily pay for its own add-on costs, unless the function()
computation is indeed many times above the costs of process-instantiation + process-to-process data interchange ( SER/DES on the way "there" ( one can imagine a pickle.dumps()
memory allocation + pickling-compression/decompression costs ) + SER/DES on the way "back" + the actual p2p-communication latencies (costs) to move the pickled-data elements ).
One might like more reads on this here and here and here.
Is There Any Better Way Forwards?
We all have had for sure heard about numpy
and smart numpy
-vectorised processing. Many thousands of man*years top level HPC experience were put into the numpy
smart data-I/O vectorised processing.
So in most cases, if you try to redesign the function( scalarA, scalarB )
returnins a single scalarResult
to be stored into an externally 2D-looped A[i,j]
into an in-place modifying function( vectorX_data, matrixA_results )
and let the inner code thereof do both the i,j
-looping over the actual matrixA_results.shape[0]
and do the actual computing, the results may get astonishingly faster, if the numpy
-code can harness the smart CPU-vector instructions, that pay less than 0.5 [ns]
L1_data access latency times compared to as much as 300 ~ 380 [ns] RAM access latency times ( if memory-I/O channel were free and permitting unenqueued data transfer from the slow & far RAM-memory, not mentioning even the somewhat latency-masked 10.000.000+ [ns]
access-costs for using the numpy.memmap()
-file-based data proxy ).
If one has never visited the domain of numpy
-tricks with smart-vectorised processing, do not hesitate to read as many as possible posts from a true master in this domain, guru @Divakar - all respect to them!