3

I have a huge (~ 2 billion data points) xarray.DataArray. I would like to randomly delete (either mask or replace by np.nan) a given percentage of the data, where the probability for every data point to be chosen for deletion/masking is the same across all coordinates. I can convert the array to a numpy.array but I would preferably keep it in the dask chunks for speed.

my data looks like this:

>> data
<xarray.DataArray 'stack-820860ba63bd07adc355885d96354267' (variable: 8, time: 228, latitude: 721, longitude: 1440)>
dask.array<stack, shape=(8, 228, 721, 1440), dtype=float64, chunksize=(1, 6, 721, 1440)>
Coordinates:
* latitude   (latitude) float32 90.0 89.75 89.5 89.25 89.0 88.75 88.5 ...
* variable   (variable) <U5 u'fal' u'swvl1' u'swvl3' u'e' u'swvl2' u'es' 
* longitude  (longitude) float32 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 
* time       (time) datetime64[ns] 2000-01-01 2000-02-01 2000-03-01 ...

I defined

frac_missing = 0.2
k = int(frac_missing*data.size)

this is what I already tried:

  • this solution works with np.ndindex but the np.ndindex object is converted to a list which is very slow. I tried circumventing the conversion and simply iterate over the np.ndindex object as described here and here but iterating over the whole iterator is slow for ~ 2 billion data points.
  • np.random.choice(data.stack(newdim=('latitude','variable','longitude','time')),k,replace=False) returns the desired subset of data points, but does not set them to nan

The expected output would be the xarray.DataArray with the given percentage of datapoints either set to np.nan or masked, preferably in the same shape and the same dask chunks.

climachine
  • 55
  • 7
  • 1
    Does `data[np.random.rand(*data.shape) < frac_missing] = np.nan` work? I haven't used dask, but this is how you would do it in numpy. – user545424 May 22 '19 at 17:39
  • @user545424 this is an elegant solution, however, it creates a `numpy.array` of the same size as `data`, which is too slow – climachine May 23 '19 at 14:30

1 Answers1

0

The suggestion by user545424 is an excellent start. In order to not run into memory issues, you can put it in a small user-defined function and map it on the DataArray using the method apply_ufunc.

import xarray as xr
import numpy as np

testdata = xr.DataArray(np.empty((100,1000,1000)), dims=['x','y','z'])

def set_random_fraction_to_nan(data):
    data[np.random.rand(*data.shape) < .8]=np.nan
    return data

# Set 80% of data randomly to nan
testdata = xr.apply_ufunc(set_random_fraction_to_nan, testdata, input_core_dims=[['x','y','z']],output_core_dims=[['x','y','z']], dask='parallelized')

For some more explanation on wrapping custom functions to work with xarray, see here.

Basileios
  • 285
  • 2
  • 6
  • when `testdata` is used for several final data products that are only lazily evaluated at the end, their patterns of missingness do not agree although the command was only executed once. This is quite unexpected behavior, can be solved with a `np.random.seed(0)` in the definition of the function (see my edit). – climachine Aug 17 '19 at 11:26