Correlation between two matrices of different sizes in Python

Question

I have two matrices p (500x10000) and h (500x256) and I need to calculate the correlation in Python.

In Matlab, I used the corr() function without any problem : myCorrelation = corr( p, h );

In numpy, I tried np.corrcoef( p, h ) :

  File "/usr/local/lib/python2.7/site-packages/numpy/core/shape_base.py", line 234, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

and I also tried np.correlate( p, h ) :

  File "/usr/local/lib/python2.7/site-packages/numpy/core/numeric.py", line 975, in correlate
    return multiarray.correlate2(a, v, mode)
ValueError: object too deep for desired array

Input :

pw.shape = (500, 10000)
hW.shape = (500, 256)

First, I have tried this :

myCorrelationMatrix, _ = scipy.stats.pearsonr( pw, hW )

Result :

    myCorrelationMatrix, _ = scipy.stats.pearsonr( pw, hW )
  File "/usr/local/lib/python2.7/site-packages/scipy/stats/stats.py", line 3019, in pearsonr
    r_num = np.add.reduce(xm * ym)
ValueError: operands could not be broadcast together with shapes (500,10000) (500,256)

and tried this :

myCorrelationMatrix = corr2_coeff( pw, hW )

where corr2_coeff according to 1 is :

def corr2_coeff(A,B) :
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1)[:,None]
    B_mB = B - B.mean(1)[:,None]

    # Sum of squares across rows
    ssA = (A_mA**2).sum(1);
    ssB = (B_mB**2).sum(1);

    # Finally get corr coeff
    return np.dot(A_mA,B_mB.T)/np.sqrt(np.dot(ssA[:,None],ssB[None]))

the result is this :

    myCorrelationMatrix, _ = corr2_coeff( powerTraces, hW )
  File "./myScript.py", line 175, in corr2_coeff
    return np.dot(A_mA,B_mB.T)/np.sqrt(np.dot(ssA[:,None],ssB[None]))
ValueError: shapes (500,10000) and (256,500) not aligned: 10000 (dim 1) != 256 (dim 0)

and finally tried this :

myCorrelationMatrix = corr_coeff( pw, hW )

where corr_coeff according to 2 is :

def corr_coeff(A,B) :
    # Get number of rows in either A or B
    N = B.shape[0]

    # Store columnw-wise in A and B, as they would be used at few places
    sA = A.sum(0)
    sB = B.sum(0)

    # Basically there are four parts in the formula. We would compute them one-by-one
    p1 = N*np.einsum('ij,ik->kj',A,B)
    p2 = sA*sB[:,None]
    p3 = N*((B**2).sum(0)) - (sB**2)
    p4 = N*((A**2).sum(0)) - (sA**2)

    # Finally compute Pearson Correlation Coefficient as 2D array
    pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))

    # Get the element corresponding to absolute argmax along the columns
#   out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]

    return pcorr

and the result is :

RuntimeWarning: invalid value encountered in sqrt
  pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
RuntimeWarning: invalid value encountered in divide
  pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))

Update

This is not a duplicate, I've tried both methods you gave on Computing the correlation coefficient between two multi-dimensional arrays and Efficient pairwise correlation for two matrices of features and none of them worked.

See if these help out - [1](https://stackoverflow.com/questions/30143417/computing-the-correlation-coefficient-between-two-multi-dimensional-arrays), [2](https://stackoverflow.com/questions/33650188/efficient-pairwise-correlation-for-two-matrices-of-features). — Divakar, Jun 23 '17 at 16:46
@Divakar : I'm not sure it's a duplicate because my two input 2D arrays don't have the same shape like in the example given in [1](https://stackoverflow.com/a/30143754/5649639) — SebMa, Jun 27 '17 at 16:33
@Divakar I have updated my question, can you please have a look ? — SebMa, Jun 27 '17 at 17:44
@Divakar : I also feed in the hW.T, but I got the same error messages :( — SebMa, Jun 28 '17 at 10:00
Did you try : `corr_coeff(pw, hW)` and `corr2_coeff(pw.T, hW.T)`? — Divakar, Jun 28 '17 at 10:06
@Divakar I didn't see your last comment because I didn't click "See more comments", but I tried transposing both and now I get : `RuntimeWarning: invalid value encountered in divide` — SebMa, Jun 28 '17 at 12:46
It's a warning, so can you ignore it? If you do, what are the unexpected results, if any? — Divakar, Jun 28 '17 at 12:47
@Divakar I haven't dealed with matrix since 1995, I had completely forgotten that A.B is different that B.A, which moreover does not exist if the matrix dimensions are not symmetric. Thank you for your help so far. I'm having some trouble finding the index of the max of all the maxes of resulting correlation. I'll keep you updated. — SebMa, Jun 28 '17 at 13:01
@Divakar using `corr = corr2_coeff(pw.T, hW.T)`, the resulting matrix contains `nan`s, and the `corr.max` returns `nan` — SebMa, Jun 28 '17 at 13:30
So, it seems for those NaN cases, it means we have the denominator part of `p4` or `p3` as 0s, resulting in the division by 0. By the definition of correlation, I think that's undefined, so outputting NaNs for those is one way of indicating that. To put as a sample case, consider `a = np.random.rand(3,4); b = np.random.rand(3,4); a[0] = 0; b[:,0] = 0`. So, we do `corr2_coeff(a.T, b.T)`and get the first col as all NaNs. Now, can you tell us what must be in the first column if not NaNs? I would advise posting such a sample case and telling us the desired output, if it doesn't make sense to you. — Divakar, Jun 28 '17 at 14:40
@Divakar I do have `NaNs` in `corr[0]` but NOT in the INPUT data BTW: I not using `corr_coeff` function with p1, p2, p3, p4 variables but `corr2_coeff` function in [1](https://stackoverflow.com/a/30143754/5649639) — SebMa, Jun 28 '17 at 14:44
As I said, you can still have NaNs in output without a NaN in input. Please read my previous comment again. — Divakar, Jun 28 '17 at 14:45
@Divakar : Your question `what must be in the first column if not NaNs?` has mislead me. I thought you were saying that I said I didn't have any `NaNs` in the output. Anyway, how can I get the index of the max of this 2D array ? — SebMa, Jun 28 '17 at 14:50
If the question has been answered, please rollback the title and the edits in the question that claim to not solve the question. — Divakar, Jun 29 '17 at 19:19
@Divakar I would like to try you try your other function [Efficient pairwise correlation for two matrices of features](https://stackoverflow.com/a/33651442/5649639) to see if I can gain some execution time. So I will ask for your help tomorrow because now it's 9h30 pm, and I really need to leave the office ! — SebMa, Jun 29 '17 at 19:27
I have edited this question a bit - please do not add meta commentary in the title. A good tip is to always leave a question in a tidy state conducive to long-term readers. — halfer, Jun 30 '17 at 21:05

score 1 · Accepted Answer · edited Jun 30 '17 at 21:06

1

In matrix product equal dimensions must be "inside" the product: A[m x n]*B[n x k]. Since correlation is sum of element-wise products, it is similar to matrix product with prior normalization. You can try transpose first or second matrix.

edited Jun 30 '17 at 21:06

halfer

19,824
17
99
186

answered Jun 23 '17 at 17:10

Sairus

384
1
15

I did transpose the second matrix but it still does not work :( – SebMa Jun 26 '17 at 08:21
1

Try transpose the second or use matrix product or look numpy help. – Sairus Jun 26 '17 at 08:27

score 0 · Answer 2 · answered Jun 23 '17 at 16:47

0

You can concatenate the two dataframes into one array size (500,10256) and then run np.corrcoef() on the merged array and subset to look at the correlations of the variables of interest.

It's not very efficient, but it will work.

answered Jun 23 '17 at 16:47

Tkanno

656
6
14

Correlation between two matrices of different sizes in Python

Update

2 Answers2