Does xarray.Dataset.to_array() load the array into memory and how efficiently sample mini batches from an xarray?

Question

I am currently trying to load a big multi-dimensional array (>5 GB) into a python script. Since I use the array as training data for a machine learning model, it is important to efficiently load the data in mini batches but avoid loading the whole data set in memory once.
My idea was to use the xarray library. I load the data set with X=xarray.open_dataset("Test_file.nc"). To the best of my knowledge, this command does not load the data set in memory - so far, so good. However, I want to convert X to an array with the command X=X.to_array().

My first question is: Does X=X.to_array() load it into memory or not?

If that is done, I wonder how to best load minibatches in memory. The shape of the array is (variable,datetime,x1_position,x2_position). I want to load minibatches per datetime, which would lead to:

ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[:,ind]

The other approach would be to transpose the array before with X.transpose("datetime","variable","x1_position","x2_position") and then sample via:

ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[ind,:]

My second question is: Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?

score 0 · Answer 1 · answered Aug 21 '20 at 03:19

My first question is: Does X=X.to_array() load it into memory or not?

xarray makes use of dask to chunk (load) parts of the data into memory. You can compare X through

X = xarray.open_dataset("Test_file.nc")
# or
X = xarray.open_dataset("Test_file.nc",
         chunks={'datetime':1, 'x1_position':x1_count, 'x2_position':x2_count})

and see (print(X)) the differences between loaded datasets, or specify the chunks accordingly.

The latter way means chunking (load) only one datetime slice data into memory. I don't think you need X=X.to_array() but you can also compare the results after to_array(). My experience is that to_array() does not change the actual chunking (loading) but just the view of the data.

My second question is: Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?

I think one goal of xarray is to let users forget the details of the underlying implementation (based on numpy). Transposing may only modify the view rather than the underlying structure of the data. There certainly are some efficiency differences between the two indexing ways, depending on which one is accessing data along contiguous memory. But such difference would not be overhead. Feel free to use both.

Does xarray.Dataset.to_array() load the array into memory and how efficiently sample mini batches from an xarray?

1 Answers1