1

I need routinely populate matrices A[i,j] by evaluation of a function between pairs of vectors, as computation of every i,j-pair is independent from each other I want to parallelize this

A = np.zeros((n, n))

for i in range(n):
        for j in range(i+1, n):
            A[i,j] = function(X[i], X[j])

How this computation could be elegantly parallelized via joblib or other widely used library?

Sengiley
  • 221
  • 2
  • 7

1 Answers1

0

Q : "How this computation could be elegantly parallelized via joblib or other widely used library?"

If using joblib, the main python interpreter will spawn other, GIL-lock independent copies of its own ( yes, huge memory-I/O to copy all python interpreter state, including all data-structures in O/S Windows, somewhat less horrible initial latency hit in linux-type O/S ), yet the worse is only to come - any "remote" modification of the spawned/distributed replicas of the original data have to somehow make it back to the main-python-process ( yes, huge memory-I/O + cache-(de)coherency hardware workloads (plus per-core L1-data cache-efficiency almost for sure devastated) ).

So this trick does not easily pay for its own add-on costs, unless the function() computation is indeed many times above the costs of process-instantiation + process-to-process data interchange ( SER/DES on the way "there" ( one can imagine a pickle.dumps() memory allocation + pickling-compression/decompression costs ) + SER/DES on the way "back" + the actual p2p-communication latencies (costs) to move the pickled-data elements ).

One might like more reads on this here and here and here.


Is There Any Better Way Forwards?

We all have had for sure heard about numpy and smart numpy-vectorised processing. Many thousands of man*years top level HPC experience were put into the numpy smart data-I/O vectorised processing.

So in most cases, if you try to redesign the function( scalarA, scalarB ) returnins a single scalarResult to be stored into an externally 2D-looped A[i,j] into an in-place modifying function( vectorX_data, matrixA_results ) and let the inner code thereof do both the i,j-looping over the actual matrixA_results.shape[0] and do the actual computing, the results may get astonishingly faster, if the numpy-code can harness the smart CPU-vector instructions, that pay less than 0.5 [ns] L1_data access latency times compared to as much as 300 ~ 380 [ns] RAM access latency times ( if memory-I/O channel were free and permitting unenqueued data transfer from the slow & far RAM-memory, not mentioning even the somewhat latency-masked 10.000.000+ [ns] access-costs for using the numpy.memmap()-file-based data proxy ).

If one has never visited the domain of numpy-tricks with smart-vectorised processing, do not hesitate to read as many as possible posts from a true master in this domain, guru @Divakar - all respect to them!

halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92
  • okay, you say that joblib is overkill in my situation and suggest the vectorized operations instead, moreover invite me to think! – Sengiley Nov 07 '20 at 22:29
  • but I want a quick example how to do that, also I emphasize that a function operates on pairs of vectors of numbers (or even matrices) and returns a single number, thanks – Sengiley Nov 07 '20 at 22:31
  • Ok, feel free to re-read the vectorised code examples in https://stackoverflow.com/questions/62249186/how-to-use-prange-in-cython/62262503#62262503 ( or more from the smart-vectorisation guru-of-the-gurus @Divakar ) - once you get the concept of vectorised ( [:]-vector, [:,?]-matrix or [:,?,?]-tensor oriented syntax tricks ) operations, your code will burst into numpy-available super-performance – user3666197 Nov 07 '20 at 22:59