5

I have a large pandas dataframe (97165 rows and 2 columns) and I'd like to calculate and save the correlation among those columns for each 100 rows I wanna something like this:

1st correlation --> rows from 0 to 100 --> corr = 0.265

2nd correlation --> rows from 1 to 101 --> corr = 0.279

3rd correlation --> rows from 2 to 102 --> corr = 0.287

Each value has to be stored and after showed in a plot so I have to save all this values in a list or something like this.

I have been reading the pandas documentation related to rolling window pandas rolling window but I was not able to achieve anything. I have tried to generate a simple loop to get some results but I got memory problems, the code of what I have tried is:

lcl = 100
a = []
for i in range(len(tabla)):

    x = tabla.iloc[i:lcl, [0]] 
    y = tabla.iloc[i:lcl, [1]]
    z = x['2015_Avion'].corr(y['2015_Hotel'])
    a.append(z) 
    lcl += 1

Any suggestions?

WinterZ
  • 83
  • 1
  • 8

2 Answers2

5

We can optimize on memory and performance by working with array data.

Approach #1

First off, let's have an array solution to get the correlation coefficients for corresponding elements between two 1D arrays. This would be basically inspired by this post and would look something like this -

def corrcoeff_1d(A,B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(-1,keepdims=1)
    B_mB = B - B.mean(-1,keepdims=1)

    # Sum of squares
    ssA = np.einsum('i,i->',A_mA, A_mA)
    ssB = np.einsum('i,i->',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('i,i->',A_mA,B_mB)/np.sqrt(ssA*ssB)

Now, to use it , use the same loop but on array data -

lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)
for i in range(N):
    out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])

We can optimize further on performance by pre-computing the rolling mean values as used to compute A_mA in corrcoeff_1d with convolution, but first let's get the memory error out of the way.

Approach #2

Here's an almost vectorized approach as we would vectorize most of the iterations except for the leftover slices at the end that won't have proper window lengths. The loop count would be reduced from 97165 to lcl-1 i.e. mere 99.

lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)

col0_win = strided_app(ar[:,0],lcl,S=1)
col1_win = strided_app(ar[:,1],lcl,S=1)
vectorized_out = corr2_coeff_rowwise(col0_win, col1_win)
M = len(vectorized_out)
out[:M] = vectorized_out

for i in range(M,N):
    out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])

Helper funcs -

# https://stackoverflow.com/a/40085052/ @ Divakar
def strided_app(a, L, S ):  # Window len = L, Stride len/stepsize = S
    nrows = ((a.size-L)//S)+1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))

# https://stackoverflow.com/a/41703623/ @Divakar
def corr2_coeff_rowwise(A,B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(-1,keepdims=1)
    B_mB = B - B.mean(-1,keepdims=1)

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA, A_mA)
    ssB = np.einsum('ij,ij->i',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)

Correlation for NaN filled data

NumPy solutions to Pandas based correlation computation for computing correlation between 1D arrays and row-wise correlation values are listed next.

1) Scalar correlation value between two 1D arrays -

def nancorrcoeff_1d(A,B):
    # Get combined mask
    comb_mask = ~(np.isnan(A) & ~np.isnan(B))
    count = comb_mask.sum()

    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
    B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count

    # Replace NaNs with zeros, so that later summations could be computed    
    A_mA[~comb_mask] = 0
    B_mB[~comb_mask] = 0

    ssA = np.inner(A_mA,A_mA)
    ssB = np.inner(B_mB,B_mB)

    # Finally get corr coeff
    return np.inner(A_mA,B_mB)/np.sqrt(ssA*ssB)

2) Row-wise correlation between two 2D arrays (m,n) to give us a 1D array of shape (m,) -

def nancorrcoeff_rowwise(A,B):
    # Input : Two 2D arrays of same shapes (mxn). Output : One 1D array  (m,)
    # Get combined mask
    comb_mask = ~(np.isnan(A) & ~np.isnan(B))
    count = comb_mask.sum(axis=-1,keepdims=1)

    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
    B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count

    # Replace NaNs with zeros, so that later summations could be computed    
    A_mA[~comb_mask] = 0
    B_mB[~comb_mask] = 0

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA, A_mA)
    ssB = np.einsum('ij,ij->i',B_mB, B_mB)

    # Finally get corr coeff
    return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • thanks for you help it's great but I got the same issue as in other aproach, the first 77 values are good but after that all values are 0 or nan. idk why but it's happening with all solutions – WinterZ Oct 16 '17 at 08:07
  • @WinterZ As I said in my comment earlier - [`Produce a sample with minimal representative data?`](https://stackoverflow.com/questions/46757318/python-generate-rolling-window-to-calculate-correlation/46757690?noredirect=1#comment80460007_46757318). – Divakar Oct 16 '17 at 08:12
  • There is a sample in bz2 format containing 10k rows https://ufile.io/xpneh in all of my approaches as happen in yours since the corr row 77 I got nan values... – WinterZ Oct 16 '17 at 08:43
  • @WinterZ So, [it seems you were able to fix it](https://stackoverflow.com/questions/46757318/python-generate-rolling-window-to-calculate-correlation/46757690#comment80478804_46757729)? Do both of the solutions work now? – Divakar Oct 16 '17 at 17:45
  • The issue was that I have my data sorted so I had a lot of recurrent values. Your solution is pretty good but in my case I got nan values because this line of code: A_mA = A - A.mean(-1,keepdims=1) with for example all the vector filled with 3 makes A_mA equal to 0 and the formula of the correlation does not work when this values are 0. Thank you a lot! – WinterZ Oct 17 '17 at 07:54
  • @WinterZ Check out the NaN solutions at the end. Use those. – Divakar Oct 19 '17 at 15:30
3

You mentioned trying rolling. What exactly went wrong with that? This works for me:

my_res = tabla['2015_Avion'].rolling(100).corr(tabla['2015_Hotel'])

my_res will have NaN values until its 100th value, so my_res[99] should be the correlation between the row 0 and row 99 elements of both columns, as would be returned by pandas corr function applied only to the subset. my_res[100] is the correlation between the row 1 and row 100 elements.

Antoine Zambelli
  • 724
  • 7
  • 19
  • Most of the output values are nan or 0 and in my table there are only integer values from 0 to 10..so I can't figure out why I got nan/0 values. – WinterZ Oct 15 '17 at 17:22
  • Are you reading this from a csv of some kind? Is it possible some of the values have escape characters or something non-ascii that is causing issues? I would expect it to throw an error, but that depends on how corr() works. I can't recreate the problem as of now using some random ints. – Antoine Zambelli Oct 15 '17 at 17:36
  • im using read_pickle to read the file, the format is bz2. I was looking through the dataset but I did not see anything strange. – WinterZ Oct 15 '17 at 17:40
  • I have no experience with read_pickle or bz2 in python. Try print(repr()) on some values to check if anything strange appears (print() might not catch escape characters and the like) - say on all the values in a window that gives NaN correlation. – Antoine Zambelli Oct 15 '17 at 17:51
  • If both my and Divakar 's solutions give 0/NaNs, I'd strongly recommend looking at the data itself - the problem is almost certainly there. Either checking it yourself or to mirror Divakar 's suggestion to post a sample. – Antoine Zambelli Oct 16 '17 at 08:49
  • I have been checking my data and the problem of the nan values come from the sort_values function of pandas...I checked my solution and there is no problems to run it. Thanks! – WinterZ Oct 16 '17 at 09:16
  • Glad to hear it's fixed! – Antoine Zambelli Oct 16 '17 at 10:22