2

I have two numpy arrays of identical size M X T (let's call them A and B). I'd like to compute the Pearson correlation coefficient across T between each pair of the same row m in A and B (so, A[i,:] and B[i,:], then A[j,:] and B[j,:]; but never A[i,:] and B[j,:], for example).

I'm expecting my output to be either a one-dimensional array with shape (M,) or a two-dimensional array with shape (M,1).

The arrays are quite large (on the order of 1-2 million rows), so I'm looking for a vectorized solution that will let me avoid a for-loop. Apologies if this has already been answered, but it seems like many of the code snippets in previous answers (e.g., this one) are designed to give the full M X M correlation matrix -- i.e., correlation coefficients between all possible pairs of rows, rather than just index-matched rows; what I am looking for is basically just the diagonal of this matrix, but it feels wasteful to calculate the whole thing if all I need is the diagonal -- and in fact it's throwing memory errors when I try to do that anyway....

What's the fastest way to implement this? Thanks very much in advance.

Emily Finn
  • 53
  • 3

1 Answers1

1

I think I'd just use a list-comprehension and a module for calculating the coefficient:

from scipy.stats.stats import pearsonr
import numpy as np

M = 10
T = 4
A = np.random.rand(M*T).reshape((M, T))
B = np.random.rand(M*T).reshape((M, T))
diag_pear_coef = [pearsonr(A[i, :], B[i, :])[0] for i in range(M)]

Does that work for you? Note that pearsonr returns more than just the correlation coefficient, hence the [0] indexing.
Good luck!

ShlomiF
  • 2,686
  • 1
  • 14
  • 19