I am trying to calculate a correlation between two datasets in xarray along the time dimension. My dataset are both lat x lon x time. One of my datasets has enough data missing that is isn't reasonable to interpolate and eliminate gaps, instead I would like to just ignore missing values. I have some simple bits of code that are working somewhat, but none that fits my exact use case. For example:
def covariance(x,y,dims=None):
return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)
def correlation(x,y,dims=None):
return covariance(x,y,dims) / (x.std(dims) * y.std(dims))
works well if no data is missing but of course can't work with nans. While there is a good example written for xarray here, even with this code I am struggling to calcuate the pearson's correlation not the spearman's.
import numpy as np
import xarray as xr
import bottleneck
def covariance_gufunc(x, y):
return ((x - x.mean(axis=-1, keepdims=True))
* (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)
def pearson_correlation_gufunc(x, y):
return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))
def spearman_correlation_gufunc(x, y):
x_ranks = bottleneck.rankdata(x, axis=-1)
y_ranks = bottleneck.rankdata(y, axis=-1)
return pearson_correlation_gufunc(x_ranks, y_ranks)
def spearman_correlation(x, y, dim):
return xr.apply_ufunc(
spearman_correlation_gufunc, x, y,
input_core_dims=[[dim], [dim]],
dask='parallelized',
output_dtypes=[float])
Finally there was a useful discussion on github of adding this as a feature to xarray but it has yet to be implemented. Is there an efficient way to do this on datasets with data gaps?