31

I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).

What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.

I'm expecting my output to be an array with the shape N X M.

N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.

Here are some things to note:

  • The numpy function correlate requires input arrays to be one-dimensional.
  • The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
  • The scipy.stats function pearsonr requires input arrays to be one-dimensional.
abcd
  • 10,215
  • 15
  • 51
  • 85
  • So are you looking for `"same"`, `''full"` or the default one with `np.correlate`? Did you write the loopy version of the solution? – Divakar May 09 '15 at 18:39
  • i'm looking for `'valid'`. – abcd May 09 '15 at 18:41
  • yeah, the loopy version is trivial: `for n in range(N):` . . . `for m in range(M):` . . . `correlate(arr_one[n, :], arr_two[m, :])` . . . – abcd May 09 '15 at 18:42

3 Answers3

45

Correlation (default 'valid' case) between two 2D arrays:

You can simply use matrix-multiplication np.dot like so -

out = np.dot(arr_one,arr_two.T)

Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.


Row-wise Correlation Coefficient calculation for two 2D arrays:

def corr2_coeff(A, B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1)[:, None]
    B_mB = B - B.mean(1)[:, None]

    # Sum of squares across rows
    ssA = (A_mA**2).sum(1)
    ssB = (B_mB**2).sum(1)

    # Finally get corr coeff
    return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))

This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB

Benchmarking

This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.

Case #1:

In [106]: A = np.random.rand(1000, 100)

In [107]: B = np.random.rand(1000, 100)

In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop

In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop

Case #2:

In [110]: A = np.random.rand(5000, 100)

In [111]: B = np.random.rand(5000, 100)

In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop

In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop

Case #3:

In [114]: A = np.random.rand(10000, 10)

In [115]: B = np.random.rand(10000, 10)

In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop

In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop

The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -

In [118]: A = np.random.rand(1000, 100)

In [119]: B = np.random.rand(1000, 100)

In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop

In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop

In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop
Riley
  • 4,122
  • 3
  • 16
  • 30
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • nice. i did not realize `newaxis` was an alias for `None`. i think you're missing a `, :` from the slice into `sb1` on the second-to-last line. it'd be interesting to time our answers as compared to the double-for-loop method. – abcd May 09 '15 at 23:35
  • @dbliss That [None] was intentional, to make that a row vector and the other one was made a column vector with [:,None]. All that was required to make broadcasting come into play. Added runtime tests, check those out. – Divakar May 10 '15 at 08:55
  • nice effort, but i'm not sure how informative the timing results you report are. for example, `test_generate_correlation_map` includes both the loop method and my function `generate_correlation_map`. and, though this probably wouldn't make much of a difference, `generate_correlation_map` checks that the two inputs have the same size second dimension, whereas yours does not. that said, i think it's safe to conclude your function is faster than mine. but it may be true that a hybrid function is the best -- line-by-line timing information would speak to that. – abcd May 12 '15 at 06:20
  • @dbliss Do you mind if I include the error checking portion of your code and update the code proposed in this solution and runtimes? I didn't bother to include as the question said at the start that those two arrays have the same number of columns as T. – Divakar May 12 '15 at 06:24
  • @dbliss Updated the runtimes for the proposed approach that includes the same error-checking that you used in your approach. The error checking didn't change the runtimes from the previous runs by a big margin, which was expected. Also, I am not sure what you would be looking for in line-by-line timing information. Is there something specific in mind you have with it? – Divakar May 12 '15 at 06:43
  • it looks like you've updated `test_generate_correlation_map`'s time, and it's increased. i'm not sure what happened there. perhaps you had already cut out the call to `generate_correlation_map` in your first post? re: the line-by-line timing: you calculate your numerator and your denominator differently from how i do. it could be that your numerator calculation is faster than mine, but your denominator calculation is slower, or the other way around. – abcd May 12 '15 at 15:24
  • @dbliss I had updated all the runtimes, so all these runtimes are for a fresh run and with my code having the error-checking included (not shown in the solution though). So, if you see the revision history, everything of those runtimes must have changed a bit because of these being run at a different time. Regarding the numerator calculation, I would say mine must be faster as my method is basically a raw version of covariance calculation. The denominator looks the same as yours, so I am not hoping to see much of timing difference for the denominator part. – Divakar May 12 '15 at 15:40
  • @dbliss Hope this answers all your doubts. Just curious - Any particular reason for the downvote? – Divakar May 12 '15 at 15:40
  • hmm, i'm still confused about the timing for `test_generate_correlation_map`: did you test the function as written in my answer? if so, this timing information is not very informative, for the reason i mentioned. if not, this is not clear. sure, thanks for the question re: the downvote. it merely reflects my confusion about the timing information you report. i'm looking forward to changing it back to an upvote! :) – abcd May 12 '15 at 15:45
  • @dbliss The `test_generate_correlation_map` function I have used stops after getting out of the two nested loops, so it doesn't include any call to `generate_correlation_map`. I do understand that you had used `actual = generate_correlation_map(x, y)` and then `np.testing.assert_array_almost_equal(actual, desired)` just to verify for value correctness, so be sure that those are not part of the runtime tests for `test_generate_correlation_map`. Does this answer your question(s)? – Divakar May 12 '15 at 16:02
  • exactly, yes, well done! you do not state in your answer that you have edited the code of `test_generate_correlation_map`. hence my confusion about whether you had performed the runtime tests correctly. – abcd May 12 '15 at 16:10
  • @dbliss Updated the solution to cover all those clarifications. – Divakar May 12 '15 at 16:35
  • why not simply use Scipy's [correlated_2d](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate2d.html)? – Taras Kucherenko Jun 07 '21 at 10:34
13

@Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.

In order to calculate the correlation coefficient, a bit more is required:

import numpy as np

def generate_correlation_map(x, y):
    """Correlate each n with each m.

    Parameters
    ----------
    x : np.array
      Shape N X T.

    y : np.array
      Shape M X T.

    Returns
    -------
    np.array
      N X M array in which each element is a correlation coefficient.

    """
    mu_x = x.mean(1)
    mu_y = y.mean(1)
    n = x.shape[1]
    if n != y.shape[1]:
        raise ValueError('x and y must ' +
                         'have the same number of timepoints.')
    s_x = x.std(1, ddof=n - 1)
    s_y = y.std(1, ddof=n - 1)
    cov = np.dot(x,
                 y.T) - n * np.dot(mu_x[:, np.newaxis],
                                  mu_y[np.newaxis, :])
    return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])

Here's a test of this function, which passes:

from scipy.stats import pearsonr

def test_generate_correlation_map():
    x = np.random.rand(10, 10)
    y = np.random.rand(20, 10)
    desired = np.empty((10, 20))
    for n in range(x.shape[0]):
        for m in range(y.shape[0]):
            desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
    actual = generate_correlation_map(x, y)
    np.testing.assert_array_almost_equal(actual, desired)
abcd
  • 10,215
  • 15
  • 51
  • 85
2

For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.

def pearsonr_2D(x, y):
    """computes pearson correlation coefficient
       where x is a 1D and y a 2D array"""

    upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
    lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
    
    rho = upper / lower
    
    return rho

Example run:

>>> x
Out[1]: array([1, 2, 3])

>>> y
Out[2]: array([[ 1,  2,  3],
               [ 6,  7, 12],
               [ 9,  3,  1]])

>>> pearsonr_2D(x, y)
Out[3]: array([ 1.        ,  0.93325653, -0.96076892])
pr94
  • 1,263
  • 12
  • 24