Overhead in parallel tasks in Python

Question

I have a coordinate array of shape X.shape = (100000, 1000) and a property array of shape P.shape = (100000, 1000). I wish to calculate the distance dependent autocorrelation function of property P which I explain below.

First I remind that the "bare" autocorrelation of P would be Autocorrelation = [np.mean(P[tau:]*P[:-tau], axis=(0, 1)) for tau in range(P.shape[0])].

For the distance dependent autocorrelation, I have to find all points whose tau shifted coordinates are within a given distance d. That is, I look for all indices i and j along axis=1 for which np.abs(X[time, i]-X[time+tau, j]) == d and I save the times along axis=0 each time I find such pairs, then I build the averages for each tau and d.

For a given tau I wrote the function shown below

import numpy as np
from joblib import Parallel, delayed  

X = np.ones([100000, 1000])
P = np.ones([100000, 1000])

def Myfunc(tau):

   AutocorrelationP = []

    for t in range(X.shape[0]-tau):
    
        d = np.round(np.sqrt((X[time][:, None]-X[time+tau])**2))
    
        Pd = []
    
        for i in np.arange(1, 11):   
        
            Indices = np.where(d==i)
        
            Pd.append(np.mean(P[time, Indices[0]]*P[time+tau, Indices[1]]))
        
        AutocorrelationP.append(Pd)
    
    return np.mean(np.array(AutocorrelationP), axis=0)

I call Parallel from Joblib to run this function on 10 CPU cores for the different values of tau. At the moment the script seems to be fine and all CPUs are loaded to 100% in green in htop.

The problem is that the parallelization overhead gets large if I use all 64 CPUs on my node. I expected calling myFunc 64 times in parallel from Joblib would involve strictly no overhead as the tasks are independent (except that they read from the same variables). But it seems that this is only true if I run up to 10 jobs in parallel, and starting from 10 I can see roughly a linear increase of runtime as a function of the job number.

I did this "runtime vs job number" calculation for different sizes of P and X to understand the problem, and I saw that this behavior is only seen if X.shape[1] is large, while the overhead disappears if X.shape[1] is roughly 100. In the extreme case of X.shape[1]=5000 all CPUs turn to red during execution. So from these tests I understand that memory is the main issue. X.shape[0] has also a less significant but still noticeable effect.

Update

To avoid the memory problem I try to avoid building distance matrices from numpy broadcasting and try to rely on explicit Python loops decorated with numba instead. Here is my implementation :

from numba import jit

@jit(nopython=True)
def myFunc(tau):

    PAutocorrelation = np.zeros(11)

    for t in range(X.shape[0]-tau):  

        for i in range(X.shape[1]):

             for j in range(X.shape[1]):

                  PAutocorrelation[int(np.rint(np.abs(X[t, i]-X[t+tau,j])))] += P[t, i]*P[t+tau, j]  

    return PAutocorrelation

Actually this approach seems to be better with regard to parallelization overhead tested on smaller arrays. But if I use the array dimensions mentioned above, the kernel dies. So I think this numba approach also suffers from a memory problem although no additional large matrices are computed here. Can you suggest why ?

Finally, here is slightly improved faster alternative based on numpy but it suffers from the same problem as the first approach

from fast_histogram import histogram1d

def myFunc(tau):
    
    PAutocorrelation = []

    for t in range(100):

         PAutocorrelation.append( \
           histogram1d(np.abs(x[time][:, None]-X[time+tau]).flatten(), \
                       bins=10, range=[0., 10], \
                     weights=(P[time][:,None]*P[time+tau]).flatten())] )    

     return PAutocorrelation

In your previous question, I provided an [answer](https://stackoverflow.com/a/76972772/12939557) which also applies here to strongly improve the performance of your script. Did you try the suggestion (about using Numba and loops) ? Your current script create a (huge?) distance matrix which is only useful to search 3 values. This is far more efficient to search the value on the fly and not create the whole distance matrix, especially since `Distances` is travelled 3 times. — Jérôme Richard, Aug 29 '23 at 19:15
By the way, while GPUs can make the distance computation faster, the computation will likely be memory-bound like on CPU. This means the speed up will not be huge (and the hardware used inefficiently). This is why the script need to be optimised first. Also please note that PC GPUs are slow for computing double-precision numbers and a fast GPU implementation may be significantly more complex than your current code. — Jérôme Richard, Aug 29 '23 at 19:18
Finally, please share a minimal reproducible example. The current one is not written in valid Python (eg. `for i in 0, 5, 9:`). — Jérôme Richard, Aug 29 '23 at 19:20
Thanks for your comments. This question is different from the previous questions, here I do not see unloaded or red CPUs in htop, which is why I think it is different. Replacing numpy mean and numpy addition with loops is really difficult here, but I will consider that to use numba. I will also think for way how to avoid allocation of the distance matrix. But please note that the code is written in valid Python, including that loop you mention. — YoussefMabrouk, Aug 29 '23 at 19:43
@JérômeRichard that's actually syntactically valid, it is a tuple literal. — juanpa.arrivillaga, Aug 31 '23 at 19:09
I guess the question has been super-seeded by the [newer one](https://stackoverflow.com/questions/77014530/overhead-in-parallel-tasks-in-c). Note that the C++ code can be improved a bit if you are searching for some specific values like in this code (<=3 values should give a small speed up and 1 value should give a significant one). Also note that the Numba code would look lik the C++ code. There is no chance for Numpy to be even close to the C++ code in term of speed. Numba may be closer but still significantly slower than the C++ code (due to the overhead of the wraparround for example). — Jérôme Richard, Sep 02 '23 at 08:46

Overhead in parallel tasks in Python

0 Answers0

Linked