Pearson's correlation coefficient between all pairs of rows from two 2D arrays using scipy.stats.pearsonr vs. numpy.corrcoeff in python 3.5

Question

I tried to calculate the Pearson's correlation coefficients between every pairs of rows from two 2D arrays. Then, sort the rows/columns of the correlation matrix based on its diagonal elements. First, the correlation coefficient matrix (i.e., 'ccmtx') was calculated from one random matrix (i.e., 'randmtx') in the following code:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

def correlation_map(x, y):
    n_row_x = x.shape[0]
    n_row_y = x.shape[0]
    ccmtx_xy = np.empty((n_row_x, n_row_y))
    for n in range(n_row_x):
        for m in range(n_row_y):
            ccmtx_xy[n, m] = pearsonr(x[n, :], y[m, :])[0]

    return ccmtx_xy

randmtx = np.random.randn(100, 1000) # generating random matrix
#ccmtx = np.corrcoef(randmtx, randmtx) # cc matrix based on numpy.corrcoef
ccmtx = correlation_map(randmtx, randmtx) # cc matrix based on scipy pearsonr
#
ccmtx_diag = np.diagonal(ccmtx)
#
ids, vals = np.argsort(ccmtx_diag, kind = 'mergesort'), np.sort(ccmtx_diag, kind = 'mergesort')
#ids, vals = np.argsort(ccmtx_diag, kind = 'quicksort'), np.sort(ccmtx_diag, kind = 'quicksort')

plt.plot(ids)
plt.show()

plt.plot(ccmtx_diag[ids])
plt.show()

vals[0]

The issue here is when the 'pearsonr' was used, the diagonal elements of 'ccmtx' are exactly 1.0 which makes sense. However, the 'corrcoef' was used, the diagonal elements of 'ccmtrix' are not exactly one (and slightly less than 1 for some diagonals) seemingly due to a precision error of floating point numbers.

I found to be annoying that the auto-correlation matrix of a single matrix have diagnoal elements not being 1.0 since this resulted in the shuffling of rows/columes of the correlation matrix when the matrix is sorted based on the diagonal elements.

My questions are:

[1] is there any good way to accelerate the computation time when I stick to use the 'pearsonr' function? (e.g., vectorized pearsonr?)

[2] Is there any good way/practice to prevent this precision error when using the 'corrcoef' in numpy? (e.g. 'decimals' option in np.around?)

I have searched the correlation coefficient calculations between all pairs of rows or columns from two matrices. However, as the algorithms containe some sort of "cov / variance" operation, this kind of precision issue seems always existing.

Minor point: the 'mergesort' option seems to provide reliable results than the 'quicksort' as the quicksort shuffled 1d array with exactly 1 to random order.

Any thoughts/comments would be greatly appreciated!

BTW, here are the versions: "np.__version__ '1.12.1' " "scipy.__version__ '0.19.0' " — JHL, Jul 19 '17 at 20:00
[Vectorized `pearsonr`](https://stackoverflow.com/a/33651442). — Divakar, Jul 19 '17 at 20:03
This code also seems to have a precision error when the autocorrelation of a single vector/array is calculated (i.e., cc != 1.0). — JHL, Jul 19 '17 at 20:43
FYI: Have you looked closely at what `ccmtx = np.corrcoef(randmtx, randmtx)` computes? Passing the second `randmtx` results in redundant calculations. Double-check the docstring, and take a look at `np.corrcoef(a)` and `np.corrcoef(a, a)` for a small array `a` (e.g. shape (4, 3)). — Warren Weckesser, Jul 20 '17 at 01:53
For vectorized Pearson: https://stackoverflow.com/questions/38943055/efficiency-issues-with-finding-correlations-between-lists-inside-lists/38946645#38946645 — Warren Weckesser, Jul 20 '17 at 02:00

MB-F · Answer 1 · 2017-07-20T07:50:34.540

For question 1 vectorized pearsonr see the comments to the question.

I will answer only question 2: how to improve the precision of np.corrcoef.

The correlation matrix R is computed from the covariance matrix C according to

.

The implementation is optimized for performance and memory usage. It computes the covariance matrix, and then performs two divisions by sqrt(C_ii) and by sqrt(Cjj). This separate square-rooting is where the imprecision comes from. For example:

np.sqrt(3 * 3) - 3 == 0.0

np.sqrt(3) * np.sqrt(3) - 3  == -4.4408920985006262e-16

We can fix this by implementing our own simple corrcoef routine:

def corrcoef(a, b):
    c = np.cov(a, b)
    d = np.diag(c)
    return c / np.sqrt(d[:, None] * d[None, :])

Note that this implementation requires more memory than the numpy implementation because it needs to store a temporary matrix with size n * n and it is slightly slower because it needs to do n^2 square roots instead of only 2 n.

Pearson's correlation coefficient between all pairs of rows from two 2D arrays using scipy.stats.pearsonr vs. numpy.corrcoeff in python 3.5

1 Answers1