The scipy module scipy.spatial.distance
includes a distance function known as Pearson's distance, which is simply 1 minus the correlation coefficient. By using the argument metric='correlation'
in scipy.spatial.distance.cdist
, you can efficiently compute Pearson's correlation coefficient for each pair of vectors in two inputs.
Here's an example. I'll modify your data so the coefficients are more varied:
In [96]: list1 = [[1, 2, 3.5], [4, 5, 6], [7, 8, 12], [10, 7, 10]]
In [97]: list2 = [[10, 20, 30], [41, 51, 60], [77, 80, 79], [80, 78, 56]]
So we know what to expect, here are the correlation coefficients computed using scipy.stats.pearsonr
:
In [98]: [pearsonr(x, y)[0] for x in list1 for y in list2]
Out[98]:
[0.99339926779878296,
0.98945694873927104,
0.56362148019067804,
-0.94491118252306794,
1.0,
0.99953863896044937,
0.65465367070797709,
-0.90112711377916588,
0.94491118252306805,
0.93453339271427294,
0.37115374447904509,
-0.99339926779878274,
0.0,
-0.030372836961539348,
-0.7559289460184544,
-0.43355498476205995]
It is more convenient to see those in an array:
In [99]: np.array([pearsonr(x, y)[0] for x in list1 for y in list2]).reshape(len(list1), len(list2))
Out[99]:
array([[ 0.99339927, 0.98945695, 0.56362148, -0.94491118],
[ 1. , 0.99953864, 0.65465367, -0.90112711],
[ 0.94491118, 0.93453339, 0.37115374, -0.99339927],
[ 0. , -0.03037284, -0.75592895, -0.43355498]])
Here's the same result computed using cdist
:
In [100]: from scipy.spatial.distance import cdist
In [101]: 1 - cdist(list1, list2, metric='correlation')
Out[101]:
array([[ 0.99339927, 0.98945695, 0.56362148, -0.94491118],
[ 1. , 0.99953864, 0.65465367, -0.90112711],
[ 0.94491118, 0.93453339, 0.37115374, -0.99339927],
[ 0. , -0.03037284, -0.75592895, -0.43355498]])
Using cdist
is much faster than calling pearsonr
in a nested loop. Here I'll use two arrays, data1
and data2
, each with size (100, 10000):
In [102]: data1 = np.random.randn(100, 10000)
In [103]: data2 = np.random.randn(100, 10000)
I'll use the convenient %timeit
command in ipython
to measure the execution time:
In [104]: %timeit c1 = [pearsonr(x, y)[0] for x in data1 for y in data2]
1 loop, best of 3: 836 ms per loop
In [105]: %timeit c2 = 1 - cdist(data1, data2, metric='correlation')
100 loops, best of 3: 4.35 ms per loop
That's 836 ms for the nested loop, and 4.35 ms for cdist
.