We can optimize on memory and performance by working with array data.
Approach #1
First off, let's have an array solution to get the correlation coefficients for corresponding elements between two 1D
arrays. This would be basically inspired by this post
and would look something like this -
def corrcoeff_1d(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1,keepdims=1)
B_mB = B - B.mean(-1,keepdims=1)
# Sum of squares
ssA = np.einsum('i,i->',A_mA, A_mA)
ssB = np.einsum('i,i->',B_mB, B_mB)
# Finally get corr coeff
return np.einsum('i,i->',A_mA,B_mB)/np.sqrt(ssA*ssB)
Now, to use it , use the same loop but on array data -
lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)
for i in range(N):
out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])
We can optimize further on performance by pre-computing the rolling mean values as used to compute A_mA
in corrcoeff_1d
with convolution
, but first let's get the memory error out of the way.
Approach #2
Here's an almost vectorized approach as we would vectorize most of the iterations except for the leftover slices at the end that won't have proper window lengths. The loop count would be reduced from 97165
to lcl-1
i.e. mere 99
.
lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)
col0_win = strided_app(ar[:,0],lcl,S=1)
col1_win = strided_app(ar[:,1],lcl,S=1)
vectorized_out = corr2_coeff_rowwise(col0_win, col1_win)
M = len(vectorized_out)
out[:M] = vectorized_out
for i in range(M,N):
out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])
Helper funcs -
# https://stackoverflow.com/a/40085052/ @ Divakar
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
# https://stackoverflow.com/a/41703623/ @Divakar
def corr2_coeff_rowwise(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1,keepdims=1)
B_mB = B - B.mean(-1,keepdims=1)
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)
# Finally get corr coeff
return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)
Correlation for NaN filled data
NumPy solutions to Pandas based correlation computation for computing correlation between 1D arrays and row-wise correlation values are listed next.
1) Scalar correlation value between two 1D arrays -
def nancorrcoeff_1d(A,B):
# Get combined mask
comb_mask = ~(np.isnan(A) & ~np.isnan(B))
count = comb_mask.sum()
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count
# Replace NaNs with zeros, so that later summations could be computed
A_mA[~comb_mask] = 0
B_mB[~comb_mask] = 0
ssA = np.inner(A_mA,A_mA)
ssB = np.inner(B_mB,B_mB)
# Finally get corr coeff
return np.inner(A_mA,B_mB)/np.sqrt(ssA*ssB)
2) Row-wise correlation between two 2D
arrays (m,n)
to give us a 1D
array of shape (m,)
-
def nancorrcoeff_rowwise(A,B):
# Input : Two 2D arrays of same shapes (mxn). Output : One 1D array (m,)
# Get combined mask
comb_mask = ~(np.isnan(A) & ~np.isnan(B))
count = comb_mask.sum(axis=-1,keepdims=1)
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count
# Replace NaNs with zeros, so that later summations could be computed
A_mA[~comb_mask] = 0
B_mB[~comb_mask] = 0
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)
# Finally get corr coeff
return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)