In my experimentation so far, I've tried:
xr.open_dataset
withchunks
arg, and it loads the data into memory.- Set up a
NetCDF4DataStore
, and callds['field'].values
and it loads the data into memory. - Set up a
ScipyDataStore
withmmap='r'
, andds['field'].values
loads the data into memory.
From what I have seen, the design seems to center not around actually applying numpy functions on memory-mapped arrays, but rather loading small chunks into memory (sometimes using memory-mapping to do so). For example, this comment. And somewhat related comment here about not xarray not being able to determine whether a numpy array is mmapped or not.
I'd like to be able to represent and slice data as an xarray.Dataset
, and be able to call .values
(or .data
) to get an ndarray
, but have it remain mmapped (for purposes of shared-memory and so on).
It would also be nice if chunked dask operations could at least operate on the memory-mapped array until it actually needs to mutate something, which seems possible since dask seems to be designed around immutable arrays.
I did find a trick with xarray, though, which is to do like so:
data=np.load('file.npy', mmap_mode='r')
ds=xr.Dataset({'foo': (['dim1', 'dim2'], data)})
At this point, things like the following work without loading anything into memory:
np.sum(ds['foo'].values)
np.sum(ds['foo'][::2,:].values)
...xarray apparently doesn't know that the array is mmapped, and can't afford to impose a np.copy
for cases like these.
Is there a "supported" way to do read-only memmapping (or copy-on write for that matter) in xarray or dask?