Calculate correlation in xarray with missing data

Question

I am trying to calculate a correlation between two datasets in xarray along the time dimension. My dataset are both lat x lon x time. One of my datasets has enough data missing that is isn't reasonable to interpolate and eliminate gaps, instead I would like to just ignore missing values. I have some simple bits of code that are working somewhat, but none that fits my exact use case. For example:

def covariance(x,y,dims=None):
    return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)

def correlation(x,y,dims=None):
    return covariance(x,y,dims) / (x.std(dims) * y.std(dims))

works well if no data is missing but of course can't work with nans. While there is a good example written for xarray here, even with this code I am struggling to calcuate the pearson's correlation not the spearman's.

import numpy as np
import xarray as xr
import bottleneck

def covariance_gufunc(x, y):
    return ((x - x.mean(axis=-1, keepdims=True))
            * (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)

def pearson_correlation_gufunc(x, y):
    return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))

def spearman_correlation_gufunc(x, y):
    x_ranks = bottleneck.rankdata(x, axis=-1)
    y_ranks = bottleneck.rankdata(y, axis=-1)
    return pearson_correlation_gufunc(x_ranks, y_ranks)

def spearman_correlation(x, y, dim):
    return xr.apply_ufunc(
        spearman_correlation_gufunc, x, y,
        input_core_dims=[[dim], [dim]],
        dask='parallelized',
        output_dtypes=[float])

Finally there was a useful discussion on github of adding this as a feature to xarray but it has yet to be implemented. Is there an efficient way to do this on datasets with data gaps?

score 5 · Answer 1 · answered Feb 22 '20 at 13:39

I've been following this Github discussion and the subsequent attempts to implement a .corr() method, seems like we're pretty close but it's still not there yet.

In the meantime, the basic code which most are attempting to merge is outlined pretty well in this other answer (How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?). It's a good solution which leverages vectorized operations in NumPy and with some small tweaking (see accepted answer in the link) can be made to account for NaNs along the time axis.

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr

Hope this helps! Fingers crossed the merge comes soon for this.

The PR for `xr.cov` and `xr.corr` has been merged at https://github.com/pydata/xarray/pull/4089, should be out in the upcoming xarray v0.16.0. — weiji14, May 28 '20 at 03:32
Does this function lag_linregress_3d account for lagged correlation too? Say if you want to do a lagged correlations analysis with a time series a multidimensional array? Is there a built in xarray function to perform lagged correlations? Thanks! — wabash, Sep 07 '21 at 18:42

score 0 · Answer 2 · answered Jul 31 '20 at 18:09

The solution is in the github thread https://github.com/pydata/xarray/issues/1115

def covariance(x, y, dim=None):
    valid_values = x.notnull() & y.notnull()
    valid_count = valid_values.sum(dim)

    demeaned_x = (x - x.mean(dim)).fillna(0)
    demeaned_y = (y - y.mean(dim)).fillna(0)
    
    return xr.dot(demeaned_x, demeaned_y, dims=dim) / valid_count

def correlation(x, y, dim=None):
    # dim should default to the intersection of x.dims and y.dims
    return covariance(x, y, dim) / (x.std(dim) * y.std(dim))

Calculate correlation in xarray with missing data

2 Answers2

The solution is in the github thread https://github.com/pydata/xarray/issues/1115