I am currently trying to load a big multi-dimensional array (>5 GB) into a python script. Since I use the array as training data for a machine learning model, it is important to efficiently load the data in mini batches but avoid loading the whole data set in memory once.
My idea was to use the xarray library.
I load the data set with X=xarray.open_dataset("Test_file.nc")
. To the best of my knowledge, this command does not load the data set in memory - so far, so good. However, I want to convert X
to an array with the command X=X.to_array()
.
My first question is: Does X=X.to_array()
load it into memory or not?
If that is done, I wonder how to best load minibatches in memory. The shape of the array is (variable,datetime,x1_position,x2_position). I want to load minibatches per datetime, which would lead to:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[:,ind]
The other approach would be to transpose the array before with X.transpose("datetime","variable","x1_position","x2_position")
and then sample via:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[ind,:]
My second question is:
Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:]
take as long as X[:,ind]
?