1

I have two matrices p (500x10000) and h (500x256) and I need to calculate the correlation in Python.

In Matlab, I used the corr() function without any problem : myCorrelation = corr( p, h );

In numpy, I tried np.corrcoef( p, h ) :

  File "/usr/local/lib/python2.7/site-packages/numpy/core/shape_base.py", line 234, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

and I also tried np.correlate( p, h ) :

  File "/usr/local/lib/python2.7/site-packages/numpy/core/numeric.py", line 975, in correlate
    return multiarray.correlate2(a, v, mode)
ValueError: object too deep for desired array

Input :

pw.shape = (500, 10000)
hW.shape = (500, 256)

First, I have tried this :

myCorrelationMatrix, _ = scipy.stats.pearsonr( pw, hW )

Result :

    myCorrelationMatrix, _ = scipy.stats.pearsonr( pw, hW )
  File "/usr/local/lib/python2.7/site-packages/scipy/stats/stats.py", line 3019, in pearsonr
    r_num = np.add.reduce(xm * ym)
ValueError: operands could not be broadcast together with shapes (500,10000) (500,256)

and tried this :

myCorrelationMatrix = corr2_coeff( pw, hW )

where corr2_coeff according to 1 is :

def corr2_coeff(A,B) :
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1)[:,None]
    B_mB = B - B.mean(1)[:,None]

    # Sum of squares across rows
    ssA = (A_mA**2).sum(1);
    ssB = (B_mB**2).sum(1);

    # Finally get corr coeff
    return np.dot(A_mA,B_mB.T)/np.sqrt(np.dot(ssA[:,None],ssB[None]))

the result is this :

    myCorrelationMatrix, _ = corr2_coeff( powerTraces, hW )
  File "./myScript.py", line 175, in corr2_coeff
    return np.dot(A_mA,B_mB.T)/np.sqrt(np.dot(ssA[:,None],ssB[None]))
ValueError: shapes (500,10000) and (256,500) not aligned: 10000 (dim 1) != 256 (dim 0)

and finally tried this :

myCorrelationMatrix = corr_coeff( pw, hW )

where corr_coeff according to 2 is :

def corr_coeff(A,B) :
    # Get number of rows in either A or B
    N = B.shape[0]

    # Store columnw-wise in A and B, as they would be used at few places
    sA = A.sum(0)
    sB = B.sum(0)

    # Basically there are four parts in the formula. We would compute them one-by-one
    p1 = N*np.einsum('ij,ik->kj',A,B)
    p2 = sA*sB[:,None]
    p3 = N*((B**2).sum(0)) - (sB**2)
    p4 = N*((A**2).sum(0)) - (sA**2)

    # Finally compute Pearson Correlation Coefficient as 2D array
    pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))

    # Get the element corresponding to absolute argmax along the columns
#   out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]

    return pcorr

and the result is :

RuntimeWarning: invalid value encountered in sqrt
  pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
RuntimeWarning: invalid value encountered in divide
  pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))

Update

This is not a duplicate, I've tried both methods you gave on Computing the correlation coefficient between two multi-dimensional arrays and Efficient pairwise correlation for two matrices of features and none of them worked.

halfer
  • 19,824
  • 17
  • 99
  • 186
SebMa
  • 4,037
  • 29
  • 39
  • See if these help out - [1](https://stackoverflow.com/questions/30143417/computing-the-correlation-coefficient-between-two-multi-dimensional-arrays), [2](https://stackoverflow.com/questions/33650188/efficient-pairwise-correlation-for-two-matrices-of-features). – Divakar Jun 23 '17 at 16:46
  • @Divakar : I'm not sure it's a duplicate because my two input 2D arrays don't have the same shape like in the example given in [1](https://stackoverflow.com/a/30143754/5649639) – SebMa Jun 27 '17 at 16:33
  • @Divakar I have updated my question, can you please have a look ? – SebMa Jun 27 '17 at 17:44
  • Try feeding in the transposed versions? – Divakar Jun 27 '17 at 18:07
  • @Divakar : I also feed in the hW.T, but I got the same error messages :( – SebMa Jun 28 '17 at 10:00
  • Did you try : `corr_coeff(pw, hW)` and `corr2_coeff(pw.T, hW.T)`? – Divakar Jun 28 '17 at 10:06
  • @Divakar I didn't see your last comment because I didn't click "See more comments", but I tried transposing both and now I get : `RuntimeWarning: invalid value encountered in divide` – SebMa Jun 28 '17 at 12:46
  • It's a warning, so can you ignore it? If you do, what are the unexpected results, if any? – Divakar Jun 28 '17 at 12:47
  • @Divakar I haven't dealed with matrix since 1995, I had completely forgotten that A.B is different that B.A, which moreover does not exist if the matrix dimensions are not symmetric. Thank you for your help so far. I'm having some trouble finding the index of the max of all the maxes of resulting correlation. I'll keep you updated. – SebMa Jun 28 '17 at 13:01
  • @Divakar using `corr = corr2_coeff(pw.T, hW.T)`, the resulting matrix contains `nan`s, and the `corr.max` returns `nan` – SebMa Jun 28 '17 at 13:30
  • So, do you have `NaNs` in the input data? – Divakar Jun 28 '17 at 13:44
  • @Divakar : I don't have any `NaN`s in the input data – SebMa Jun 28 '17 at 14:22
  • So, it seems for those NaN cases, it means we have the denominator part of `p4` or `p3` as 0s, resulting in the division by 0. By the definition of correlation, I think that's undefined, so outputting NaNs for those is one way of indicating that. To put as a sample case, consider `a = np.random.rand(3,4); b = np.random.rand(3,4); a[0] = 0; b[:,0] = 0`. So, we do `corr2_coeff(a.T, b.T)`and get the first col as all NaNs. Now, can you tell us what must be in the first column if not NaNs? I would advise posting such a sample case and telling us the desired output, if it doesn't make sense to you. – Divakar Jun 28 '17 at 14:40
  • @Divakar I do have `NaNs` in `corr[0]` but NOT in the INPUT data BTW: I not using `corr_coeff` function with p1, p2, p3, p4 variables but `corr2_coeff` function in [1](https://stackoverflow.com/a/30143754/5649639) – SebMa Jun 28 '17 at 14:44
  • As I said, you can still have NaNs in output without a NaN in input. Please read my previous comment again. – Divakar Jun 28 '17 at 14:45
  • @Divakar : Your question `what must be in the first column if not NaNs?` has mislead me. I thought you were saying that I said I didn't have any `NaNs` in the output. Anyway, how can I get the index of the max of this 2D array ? – SebMa Jun 28 '17 at 14:50
  • That's a separate question. So, please post a new question. – Divakar Jun 28 '17 at 14:51
  • If the question has been answered, please rollback the title and the edits in the question that claim to not solve the question. – Divakar Jun 29 '17 at 19:19
  • @Divakar I would like to try you try your other function [Efficient pairwise correlation for two matrices of features](https://stackoverflow.com/a/33651442/5649639) to see if I can gain some execution time. So I will ask for your help tomorrow because now it's 9h30 pm, and I really need to leave the office ! – SebMa Jun 29 '17 at 19:27
  • I have edited this question a bit - please do not add meta commentary in the title. A good tip is to always leave a question in a tidy state conducive to long-term readers. – halfer Jun 30 '17 at 21:05

2 Answers2

1

In matrix product equal dimensions must be "inside" the product: A[m x n]*B[n x k]. Since correlation is sum of element-wise products, it is similar to matrix product with prior normalization. You can try transpose first or second matrix.

halfer
  • 19,824
  • 17
  • 99
  • 186
Sairus
  • 384
  • 1
  • 15
0

You can concatenate the two dataframes into one array size (500,10256) and then run np.corrcoef() on the merged array and subset to look at the correlations of the variables of interest.

It's not very efficient, but it will work.

Tkanno
  • 656
  • 6
  • 14