I have a coordinate array of shape X.shape = (100000, 1000)
and a property array of shape P.shape = (100000, 1000)
. I wish to calculate the distance dependent autocorrelation function of property P
which I explain below.
First I remind that the "bare" autocorrelation of P
would be Autocorrelation = [np.mean(P[tau:]*P[:-tau], axis=(0, 1)) for tau in range(P.shape[0])]
.
For the distance dependent autocorrelation, I have to find all points whose tau
shifted coordinates are within a given distance d
. That is, I look for all indices i
and j
along axis=1
for which np.abs(X[time, i]-X[time+tau, j]) == d
and I save the times along axis=0
each time I find such pairs, then I build the averages for each tau
and d
.
For a given tau
I wrote the function shown below
import numpy as np
from joblib import Parallel, delayed
X = np.ones([100000, 1000])
P = np.ones([100000, 1000])
def Myfunc(tau):
AutocorrelationP = []
for t in range(X.shape[0]-tau):
d = np.round(np.sqrt((X[time][:, None]-X[time+tau])**2))
Pd = []
for i in np.arange(1, 11):
Indices = np.where(d==i)
Pd.append(np.mean(P[time, Indices[0]]*P[time+tau, Indices[1]]))
AutocorrelationP.append(Pd)
return np.mean(np.array(AutocorrelationP), axis=0)
I call Parallel from Joblib to run this function on 10 CPU cores for the different values of tau
. At the moment the script seems to be fine and all CPUs are loaded to 100% in green in htop.
The problem is that the parallelization overhead gets large if I use all 64 CPUs on my node. I expected calling myFunc
64 times in parallel from Joblib would involve strictly no overhead as the tasks are independent (except that they read from the same variables). But it seems that this is only true if I run up to 10 jobs in parallel, and starting from 10 I can see roughly a linear increase of runtime as a function of the job number.
I did this "runtime vs job number" calculation for different sizes of P
and X
to understand the problem, and I saw that this behavior is only seen if X.shape[1]
is large, while the overhead disappears if X.shape[1]
is roughly 100. In the extreme case of X.shape[1]=5000
all CPUs turn to red during execution. So from these tests I understand that memory is the main issue. X.shape[0]
has also a less significant but still noticeable effect.
Update
To avoid the memory problem I try to avoid building distance matrices from numpy broadcasting and try to rely on explicit Python loops decorated with numba instead. Here is my implementation :
from numba import jit
@jit(nopython=True)
def myFunc(tau):
PAutocorrelation = np.zeros(11)
for t in range(X.shape[0]-tau):
for i in range(X.shape[1]):
for j in range(X.shape[1]):
PAutocorrelation[int(np.rint(np.abs(X[t, i]-X[t+tau,j])))] += P[t, i]*P[t+tau, j]
return PAutocorrelation
Actually this approach seems to be better with regard to parallelization overhead tested on smaller arrays. But if I use the array dimensions mentioned above, the kernel dies. So I think this numba approach also suffers from a memory problem although no additional large matrices are computed here. Can you suggest why ?
Finally, here is slightly improved faster alternative based on numpy but it suffers from the same problem as the first approach
from fast_histogram import histogram1d
def myFunc(tau):
PAutocorrelation = []
for t in range(100):
PAutocorrelation.append( \
histogram1d(np.abs(x[time][:, None]-X[time+tau]).flatten(), \
bins=10, range=[0., 10], \
weights=(P[time][:,None]*P[time+tau]).flatten())] )
return PAutocorrelation