3

I have two matrices where the variables are the columns , and both matrices have the same number of samples.

One matrix is 800 by 200, and the other is 800 by 100000. I want to compute the correlation matrix between the columns of these matrices so I've tried this:

import numpy as np

def matcor(x, y):
    xc = x.shape[1]
    return np.corrcoef(x, y, rowvar=False)[xc:, :xc]

xy_cor = matcor(X, Y)

However this ends up taking a large amount of memory, I get a memory error at around 64GB of memory used, and it might end up taking up more than that. Is there a memory efficient way to compute this ?

UberStuper
  • 356
  • 3
  • 17
  • What are you trying to achieve? Your problem has 100200 variables, so the correlation matrix will be 100200 x 100200. Are you only interested in the correlation between the first and the second set of variables (which would make the result 200 x 100000)? – Roland W Jan 08 '17 at 22:00
  • from my understanding, the full matrix returned is of the form : ```xx | xy yx | yy ``` so i only want `yx` or `xy`, which would be `yx.T` – UberStuper Jan 08 '17 at 22:02

1 Answers1

3

Unfortunately, the cov and corrcoef functions don't allow a direct calculation of only the xy correlation. Since the problem is obviously too big to be tackled in full, you cannot compute the full matrix and extract the slice afterwards, which is what you are currently doing. Instead, compute the xy part by hand:

samples = x.shape[0]
centered_x = x - np.sum(x, axis=0, keepdims=True) / samples 
centered_y = y - np.sum(y, axis=0, keepdims=True) / samples 
cov_xy = 1./(samples - 1) * np.dot(centered_x.T, centered_y)
var_x = 1./(samples - 1) * np.sum(centered_x**2, axis=0)
var_y = 1./(samples - 1) * np.sum(centered_y**2, axis=0)
corrcoef_xy = cov_xy / np.sqrt(var_x[:, None] * var_y[None,:])

You need the variances to normalize the covariance matrix. Else, only the first four lines would be needed.

Roland W
  • 1,401
  • 14
  • 21