I am trying to randomly delete a fraction of a xr.DataArray (as described and with the help of the answers in this question) and subsequently access only the values from the original dataset data
that were deleted.
This works fine as long as the data is not stored in dask arrays or in only one dask array. As soon as I define chunks smaller than the total size of the data, the original values are set to nan.
data = xr.DataArray(np.arange(5*5*5.).reshape(5,5,5), dims=('time','latitude','longitude'))
data.to_netcdf('/path/to/file.nc')
#data = xr.open_dataarray('/path/to/file.nc', chunks={'time':5}) # creates expected output
data = xr.open_dataarray('/path/to/file.nc', chunks={'time':2}) # creates observed output
def set_fraction_randomly_to_nan(data, frac_missing):
np.random.seed(0)
data[np.random.rand(*data.shape) < frac_missing] = np.nan
return data
data_lost = xr.apply_ufunc(set_fraction_randomly_to_nan, data.copy(deep=True), output_core_dims=[['latitude','longitude']], dask='parallelized', input_core_dims=[['latitude','longitude']], output_dtypes=[data.dtype], kwargs={'frac_missing': 0.5})
print(data[0,-4:,-4:].values)
# >>
# [[ 6. 7. 8. 9.]
# [11. 12. 13. 14.]
# [16. 17. 18. 19.]
# [21. 22. 23. 24.]]
print(data.where(np.isnan(data_lost),0)[0,-4:,-4:].values)
expected output of the last line: keep all values where np.isnan(data_lost)
is True and set rest to zero
[[ 6. 0. 0. 9.]
[ 0. 0. 0. 14.]
[16. 0. 0. 0.]
[ 0. 22. 0. 24.]]
observed output of the last line: set all values where np.isnan(data_lost)
is True to nan and set rest to zero
[[nan 0. 0. nan]
[ 0. 0. 0. nan]
[nan 0. 0. 0.]
[ 0. nan 0. nan]]
Any help in how to get the expected result while still being able to divide my (originally much larger) data into chunks is highly appreciated.