How to use xarray/dask/pandas/deepgraph for parallel pairwise correlation matrix in Python 3?

Question

I'm trying to follow the tutorial on xarray's documentation: http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization

My ultimate goal is to get a pairwise spearman correlation matrix from a dataset that has ~100,000 attributes which led me to the tutorial mentioned above. I'm testing implementations out on the iris dataset from sklearn but I'm having issues since this type of parallelization syntax is a lot different than that of joblib.

I can't figure out how to get the code below to make a pairwise spearman correlation matrix with a resulting shape of (150,150). I showed an example of doing it in pandas but this is not parallel and will take forever on my actual dataset.

Does anyone know how to adapt this xarray code to work with creating symmetric correlation measures? If not, can someone direct me to a better method for doing pairwise similarity measures. I'm aware of sklearns pairwise_distance but I'm wondering if that is the only implementation?

import bottleneck
import pandas as pd
import xarray as xr
from sklearn.datasets import load_iris

X_iris = pd.DataFrame(load_iris().data,
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
                       columns = [x.split(" (cm)")[0].replace(" ","_") for x in load_iris().feature_names])
da_iris = xr.DataArray(X_iris, dims=["samples", "attributes"])

def covariance_gufunc(x, y):
    return ((x - x.mean(axis=-1, keepdims=True))
            * (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)

def pearson_correlation_gufunc(x, y):
    return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))

def spearman_correlation_gufunc(x, y):
    x_ranks = bottleneck.rankdata(x, axis=-1)
    y_ranks = bottleneck.rankdata(y, axis=-1)
    return pearson_correlation_gufunc(x_ranks, y_ranks)

def spearman_correlation(x, y, dim):
    return xr.apply_ufunc(
        spearman_correlation_gufunc, x, y,
        input_core_dims=[[dim], [dim]],
        dask='parallelized',
        output_dtypes=[float])

How to use xarray/dask/pandas/deepgraph for parallel pairwise correlation matrix in Python 3?

0 Answers0