Optimized computation of pairwise correlations in Python

Question

Given a set of discrete locations (e.g. "sites") that are pairwise related in some categorical ways (e.g. general proximity) and contains local level data (e.g. population size), I wish to efficiently compute the mean correlation coefficients between local level data on pairwise locations that are characterized by the same relationships.

For example, I assumed 100 sites and randomized their pairwise relations using values 1 to 25, yielding the triangular matrix relations:

import numpy as np

sites = 100
categ = 25

relations = np.random.randint(low=1, high=categ+1, size=(sites, sites))
relations = np.triu(relations) # set relation_ij = relation_ji
np.fill_diagonal(relations, 0) # ignore self-relation

I also have 5000 replicates of simulation results on each site:

sims = 5000
res = np.round(np.random.rand(sites, sims),1)

To compute the mean pairwise correlation for each specific relation category, I first calculated for each relation category i the correlation coefficient rho[j] between the simulation results res of each unique site pairs j, and then taking the average across all possible pairs with relation i:

rho_list = np.ones(categ)*99

for i in range(1, categ+1):
    idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category
    comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category
    comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs

    rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category

    for j in range(len(idr)): # loop through unique site pairs
        comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result
        rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]

    rho_list[i-1] = np.nanmean(rho)

Although this script works, but once I increase sites = 400, then the entire computation can take more than 6 hrs to finish, which leads me to question my use of array functions. What is the reason for this poor performance? And how can I optimize the algorithm?

See if these help - https://stackoverflow.com/questions/30143417, https://stackoverflow.com/questions/33650188 — Divakar, Jul 30 '17 at 18:13
@Divakar Thanks. I took a few days to study those links, which taught me a lot about vectorizing the process using `einsum`. However, I still don't see how I can apply them to my problem given that I need to compute the correlation coefficients for array pairs after I remove zero-valued results from them (see above script `comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T`). — neither-nor, Aug 03 '17 at 19:12
`np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]` is not the same value as `np.corrcoef(comp_uniq[j][0], comp_uniq[j][1])[0,1]`. Is that intended? Your comments makes it seem like removing zero values is just to reduce computations, not change the outcome. — Daniel F, Aug 04 '17 at 10:55
Biggest time-saver I can find off the top is changing `np.all(comp_uniq!=0, axis=2)[j]` to `np.all(comp_uniq[j].astype(bool), axis=-1)`, so you 1) don't create the big boolean matrix each iteration and 2) save some time on useless comparisons when `([0]).asytpe(bool) = ([False])`. — Daniel F, Aug 04 '17 at 11:25

Divakar · Accepted Answer · 2017-08-04T16:34:20.060

We can vectorize the innermost loop with j iterator with some masking to take care of the ragged nature of data being processed at each iteration of that loop. We can also take out slowish np.corrcoef (inspired by this post). Additionally, we can optimize few steps at the start of the outer-loop, specially the stacking steps, which could be the bottlenecks.

Thus, the complete code would reduce to something like this -

for i in range(1, categ+1):
    r,c = np.where(relations==i)

    A = res[r]
    B = res[c]

    mask0 = ~((A!=0) & (B!=0))
    A[mask0] = 0
    B[mask0] = 0

    count = mask0.shape[-1] - mask0.sum(-1,keepdims=1)
    A_mA = A - A.sum(-1, keepdims=1)/count
    B_mB = B - B.sum(-1, keepdims=1)/count

    A_mA[mask0] = 0
    B_mB[mask0] = 0

    ssA = np.einsum('ij,ij->i',A_mA, A_mA)
    ssB = np.einsum('ij,ij->i',B_mB, B_mB)
    rho = np.einsum('ij,ij->i',A_mA, B_mB)/np.sqrt(ssA*ssB)

    rho_list[i-1] = np.nanmean(rho)

Runtime tests

Case # 1: On given sample data, with sites = 100

In [381]: %timeit loopy_app()
1 loop, best of 3: 7.45 s per loop

In [382]: %timeit vectorized_app()
1 loop, best of 3: 479 ms per loop

15x+ speedup.

Case # 2: With sites = 200

In [387]: %timeit loopy_app()
1 loop, best of 3: 1min 56s per loop

In [388]: %timeit vectorized_app()
1 loop, best of 3: 1.86 s per loop

In [390]: 116/1.86
Out[390]: 62.36559139784946

62x+ speedup.

Case # 3: Finally with sites = 400

In [392]: %timeit vectorized_app()
1 loop, best of 3: 7.64 s per loop

This took 6hrs+ at OP's end with their loopy method.

From the timings, it became clear that vectorizing the inner loop was the key to getting noticeable speedups for large sites.

Please don't bother benchmarking my (deleted now) answer. It was . . . real bad once I got a chance to check the timings. — Daniel F, Aug 04 '17 at 15:50
@Divakar I studied your code and honestly still don't fully understand it yet amazed at how well/accurate it works. Ex: the `masking` you used seems to simply replace certain pairs with 0 values, and then computed correlations from these 0-inflated arrays. However, given `a=[1,2,3,0,4,0]` and `b=[0,4,0,3,2,1]`, `np.corrcoef(a,b)[0,1]` != the zero-inflated `np.corrcoef(a0,b0)[0,1]`, where `a0 = [0,2,0,0,4,0]` and `b0 = [0,4,0,0,2,0]`. What am I misunderstanding here? — neither-nor, Aug 17 '17 at 21:54
@Divakar I also noticed that in the [older version](https://stackoverflow.com/questions/33650188/efficient-pairwise-correlation-for-two-matrices-of-features) of your script for calculating the Pearson's correlation coefficient, it uses a slightly different 4-parts breakdown of the formula than here and [here](https://stackoverflow.com/questions/30143417/computing-the-correlation-coefficient-between-two-multi-dimensional-arrays). Do you have a preference on how to numerically deconstruct the calculation? — neither-nor, Aug 17 '17 at 22:52

Optimized computation of pairwise correlations in Python

1 Answers1