5

I would like to compute the correlation between each column vector of matrix A with each column vector of matrix B.

Consider:

vectorsize = 777 
A = np.random.rand(vectorsize, 64)
B = np.random.rand(vectorsize, 36)
corr = np.corrcoef(A, B, rowvar=False)

The output of np.corrcoef in this case will be a 100x100 matrix. What does this mean?

Intuitively I'd expect to get a 64x36 matrix.

Daniel
  • 11,332
  • 9
  • 44
  • 72
  • The duplicate question does show me how to do what I want, but it *does not bring me any closer to understanding the output of np.corrcoef*, which was my original question. – Daniel Oct 20 '17 at 09:33
  • Yup, let's wait for NumPy gurus to write up the explanation. – Divakar Oct 20 '17 at 09:35

2 Answers2

6

If the method corrcoef gets two arrays x and y, it stacks them (vertically if rowVar is True, horizontally if rowVar is False). In the source:

if y is not None:
    y = array(y, copy=False, ndmin=2, dtype=dtype)
    if not rowvar and y.shape[0] != 1:
        y = y.T
    X = np.vstack((X, y))

In statistical terms, it thinks A has 64 variables (in columns, since rowVar is false), and B has 36. Stacking them gives you 100 variables, hence the 100 by 100 correlation matrix.

Correlation matrix is always symmetric (and positive semidefinite). If you only want the correlations between x and y variables, they are in an off-diagonal block of size 64 by 36: extract it with slicing. Here's the structure of the output:

 corr(x, x), size 64 by 64  |  corr(x, y), size 64 by 36
 ---------------------------+---------------------------
 corr(y, x), size 36 by 64  |  corr(y, y), size 36 by 36
4

As rowvar=False it computes correlations between columns. Therefore it calculates the Pearson correlation coefficient between each column of A with itself and every other column of B. Its the same as you concatenate the two matrix and calculate the correlation between its columns, like bellow:

C = np.hstack([A, B])   # C.shape[0] == A.shape[0] == B.shape[0] and C.shape[1] = A.shape[1] + B.shape[1]

corr_C = np.corrcoef(C, rowvar=False)

np.allclose(corr_C, corr)   # Returns True
xboard
  • 357
  • 2
  • 14