I'm trying to follow the tutorial on xarray
's documentation: http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization
My ultimate goal is to get a pairwise spearman correlation matrix from a dataset that has ~100,000 attributes which led me to the tutorial mentioned above. I'm testing implementations out on the iris dataset
from sklearn
but I'm having issues since this type of parallelization syntax is a lot different than that of joblib
.
I can't figure out how to get the code below to make a pairwise spearman correlation matrix with a resulting shape of (150,150)
. I showed an example of doing it in pandas
but this is not parallel and will take forever on my actual dataset.
Does anyone know how to adapt this xarray
code to work with creating symmetric correlation measures? If not, can someone direct me to a better method for doing pairwise similarity measures. I'm aware of sklearns pairwise_distance
but I'm wondering if that is the only implementation?
import bottleneck
import pandas as pd
import xarray as xr
from sklearn.datasets import load_iris
X_iris = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = [x.split(" (cm)")[0].replace(" ","_") for x in load_iris().feature_names])
da_iris = xr.DataArray(X_iris, dims=["samples", "attributes"])
def covariance_gufunc(x, y):
return ((x - x.mean(axis=-1, keepdims=True))
* (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)
def pearson_correlation_gufunc(x, y):
return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))
def spearman_correlation_gufunc(x, y):
x_ranks = bottleneck.rankdata(x, axis=-1)
y_ranks = bottleneck.rankdata(y, axis=-1)
return pearson_correlation_gufunc(x_ranks, y_ranks)
def spearman_correlation(x, y, dim):
return xr.apply_ufunc(
spearman_correlation_gufunc, x, y,
input_core_dims=[[dim], [dim]],
dask='parallelized',
output_dtypes=[float])