Efficiency issues with finding correlations between lists inside lists

Question

If I have two small lists and I want to find the correlation between each list inside list1 with each list inside list2, I can do this

from scipy.stats import pearsonr

list1 = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]
list2 = [[10,20,30],[40,50,60],[77,78,79],[80,78,56]]

corrVal = []
for i in list1:
    for j in list2:
        corrVal.append(pearsonr(i,j)[0])

print(corrVal)

OUTPUT: [1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588, 1.0, 1.0, 1.0, -0.90112711377916588]

Which works out fine...just about. (EDIT: Just noticed my correlation outputs above seem to give the correct answer, but they repeat 4 times. Not exactly sure why its doing that)

However for larger datasets with 1000s of values in the lists, my code freezes indefinitely, outputting no errors, hence making me force quit my IDE every time. Any ideas where I'm slipping up here? Not sure whether there is an inherent limit to how much the pearsonr function can handle or whether my coding is causing the problem.

Checkout correlation using `pandas`: http://pandas.pydata.org/pandas-docs/stable/computation.html#correlation — Nehal J Wani, Aug 14 '16 at 14:25
That freezing might just be because it's taking too long. So, to vectorize things, take a look here : [`Computing the correlation coefficient between two multi-dimensional arrays`](http://stackoverflow.com/questions/30143417/computing-the-correlation-coefficient-between-two-multi-dimensional-arrays). — Divakar, Aug 14 '16 at 20:28
*" Not exactly sure why its doing that"*. That's because `pearsonr([1, 2, 3], [10, 20, 30])` equals `pearsonr([4, 5, 6], [10, 20, 30])`. Both those values are 1. If your data was more varied than sequences of the form `[m, m+k, m+2*k]`, you would see different correlation coefficients. — Warren Weckesser, Aug 14 '16 at 20:31
...or did you expect just four values, namely `pearsonr([1, 2, 3], [10, 20, 30])`, `pearsonr([4, 5, 6], [40, 50, 60])`, `pearsonr([7, 8, 9], [77, 78, 79])` and `pearsonr([10, 11, 12], [80, 78, 56])`? I.e. `[pearsonr(v1, v2)[0] for v1, v2 in zip(list1, list2)]` — Warren Weckesser, Aug 14 '16 at 21:28

Warren Weckesser · Accepted Answer · 2016-08-14T21:13:35.113

The scipy module scipy.spatial.distance includes a distance function known as Pearson's distance, which is simply 1 minus the correlation coefficient. By using the argument metric='correlation' in scipy.spatial.distance.cdist, you can efficiently compute Pearson's correlation coefficient for each pair of vectors in two inputs.

Here's an example. I'll modify your data so the coefficients are more varied:

In [96]: list1 = [[1, 2, 3.5], [4, 5, 6], [7, 8, 12], [10, 7, 10]]

In [97]: list2 = [[10, 20, 30], [41, 51, 60], [77, 80, 79], [80, 78, 56]]

So we know what to expect, here are the correlation coefficients computed using scipy.stats.pearsonr:

In [98]: [pearsonr(x, y)[0] for x in list1 for y in list2]
Out[98]: 
[0.99339926779878296,
 0.98945694873927104,
 0.56362148019067804,
 -0.94491118252306794,
 1.0,
 0.99953863896044937,
 0.65465367070797709,
 -0.90112711377916588,
 0.94491118252306805,
 0.93453339271427294,
 0.37115374447904509,
 -0.99339926779878274,
 0.0,
 -0.030372836961539348,
 -0.7559289460184544,
 -0.43355498476205995]

It is more convenient to see those in an array:

In [99]: np.array([pearsonr(x, y)[0] for x in list1 for y in list2]).reshape(len(list1), len(list2))
Out[99]: 
array([[ 0.99339927,  0.98945695,  0.56362148, -0.94491118],
       [ 1.        ,  0.99953864,  0.65465367, -0.90112711],
       [ 0.94491118,  0.93453339,  0.37115374, -0.99339927],
       [ 0.        , -0.03037284, -0.75592895, -0.43355498]])

Here's the same result computed using cdist:

In [100]: from scipy.spatial.distance import cdist

In [101]: 1 - cdist(list1, list2, metric='correlation')
Out[101]: 
array([[ 0.99339927,  0.98945695,  0.56362148, -0.94491118],
       [ 1.        ,  0.99953864,  0.65465367, -0.90112711],
       [ 0.94491118,  0.93453339,  0.37115374, -0.99339927],
       [ 0.        , -0.03037284, -0.75592895, -0.43355498]])

Using cdist is much faster than calling pearsonr in a nested loop. Here I'll use two arrays, data1 and data2, each with size (100, 10000):

In [102]: data1 = np.random.randn(100, 10000)

In [103]: data2 = np.random.randn(100, 10000)

I'll use the convenient %timeit command in ipython to measure the execution time:

In [104]: %timeit c1 = [pearsonr(x, y)[0] for x in data1 for y in data2]
1 loop, best of 3: 836 ms per loop

In [105]: %timeit c2 = 1 - cdist(data1, data2, metric='correlation')
100 loops, best of 3: 4.35 ms per loop

That's 836 ms for the nested loop, and 4.35 ms for cdist.

hpaulj · Answer 2 · 2016-08-14T19:03:49.650

Doing the same calculation, but collecting the values in a 4x4 array:

In [1791]: res=np.zeros((4,4))
In [1792]: for i in range(4):
      ...:     for j in range(4):
      ...:         res[i,j]=stats.pearsonr(list1[i],list2[j])[0]
      ...:         
In [1793]: res
Out[1793]: 
array([[ 1.        ,  1.        ,  1.        , -0.90112711],
       [ 1.        ,  1.        ,  1.        , -0.90112711],
       [ 1.        ,  1.        ,  1.        , -0.90112711],
       [ 1.        ,  1.        ,  1.        , -0.90112711]])

All sublists are correlated (e.g. [1,2,3]*n) except the last element of list2.

I can see where this will slow down as the 2 lists become longer. I don't know how sensitive the pearsonr calculation is to the length of the inputs.

The pearsonr code looks straight forward, without loops to slow it down. It might be faster if you could skip the p value; converting each sublist to a zero mean array before hand might also reduce calculation time.

Changing the j iteration to avoid recalculating the lower triangle might also help

for j in range(i,4):

Efficiency issues with finding correlations between lists inside lists

2 Answers2

Linked