xr.DataArray.where sets valid points to nan when using several dask chunks

Question

I am trying to randomly delete a fraction of a xr.DataArray (as described and with the help of the answers in this question) and subsequently access only the values from the original dataset data that were deleted.

This works fine as long as the data is not stored in dask arrays or in only one dask array. As soon as I define chunks smaller than the total size of the data, the original values are set to nan.

data = xr.DataArray(np.arange(5*5*5.).reshape(5,5,5), dims=('time','latitude','longitude'))
data.to_netcdf('/path/to/file.nc')
#data = xr.open_dataarray('/path/to/file.nc', chunks={'time':5}) # creates expected output
data = xr.open_dataarray('/path/to/file.nc', chunks={'time':2}) # creates observed output 

def set_fraction_randomly_to_nan(data, frac_missing):
    np.random.seed(0)
    data[np.random.rand(*data.shape) < frac_missing] = np.nan
    return data

data_lost = xr.apply_ufunc(set_fraction_randomly_to_nan, data.copy(deep=True), output_core_dims=[['latitude','longitude']], dask='parallelized', input_core_dims=[['latitude','longitude']], output_dtypes=[data.dtype], kwargs={'frac_missing': 0.5})

print(data[0,-4:,-4:].values)
# >>
# [[ 6.  7.  8.  9.]
# [11. 12. 13. 14.]
# [16. 17. 18. 19.]
# [21. 22. 23. 24.]]

print(data.where(np.isnan(data_lost),0)[0,-4:,-4:].values)

expected output of the last line: keep all values where np.isnan(data_lost) is True and set rest to zero

[[ 6.  0.  0.  9.]
[ 0.  0.  0. 14.]
[16.  0.  0.  0.]
[ 0. 22.  0. 24.]]

observed output of the last line: set all values where np.isnan(data_lost) is True to nan and set rest to zero

[[nan  0.  0. nan]
[ 0.  0.  0. nan]
[nan  0.  0.  0.]
[ 0. nan  0. nan]]

Any help in how to get the expected result while still being able to divide my (originally much larger) data into chunks is highly appreciated.

score 0 · Accepted Answer · answered Aug 18 '19 at 20:40

There isn't really a notion of "deep copying" a dask array. Dask assumes that everything you apply to a dask array is a pure function (though this isn't directly enforced), so if you map a mutating function over the blocks of a dask array you are relying on undefined behavior.

The fix is to do the copy inside the applied function, e.g.,

def set_fraction_randomly_to_nan(data, frac_missing):
    np.random.seed(0)
    data = data.copy()
    data[np.random.rand(*data.shape) < frac_missing] = np.nan
    return data

xr.DataArray.where sets valid points to nan when using several dask chunks

1 Answers1