Given a set of discrete locations (e.g. "sites") that are pairwise related in some categorical ways (e.g. general proximity) and contains local level data (e.g. population size), I wish to efficiently compute the mean correlation coefficients between local level data on pairwise locations that are characterized by the same relationships.
For example, I assumed 100 sites and randomized their pairwise relations using values 1 to 25, yielding the triangular matrix relations
:
import numpy as np
sites = 100
categ = 25
relations = np.random.randint(low=1, high=categ+1, size=(sites, sites))
relations = np.triu(relations) # set relation_ij = relation_ji
np.fill_diagonal(relations, 0) # ignore self-relation
I also have 5000 replicates of simulation results on each site:
sims = 5000
res = np.round(np.random.rand(sites, sims),1)
To compute the mean pairwise correlation for each specific relation category, I first calculated for each relation category i
the correlation coefficient rho[j]
between the simulation results res
of each unique site pairs j
, and then taking the average across all possible pairs with relation i
:
rho_list = np.ones(categ)*99
for i in range(1, categ+1):
idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category
comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category
comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs
rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category
for j in range(len(idr)): # loop through unique site pairs
comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result
rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]
rho_list[i-1] = np.nanmean(rho)
Although this script works, but once I increase sites = 400
, then the entire computation can take more than 6 hrs to finish, which leads me to question my use of array functions. What is the reason for this poor performance? And how can I optimize the algorithm?