6

In my experimentation so far, I've tried:

  • xr.open_dataset with chunks arg, and it loads the data into memory.
  • Set up a NetCDF4DataStore, and call ds['field'].values and it loads the data into memory.
  • Set up a ScipyDataStore with mmap='r', and ds['field'].values loads the data into memory.

From what I have seen, the design seems to center not around actually applying numpy functions on memory-mapped arrays, but rather loading small chunks into memory (sometimes using memory-mapping to do so). For example, this comment. And somewhat related comment here about not xarray not being able to determine whether a numpy array is mmapped or not.

I'd like to be able to represent and slice data as an xarray.Dataset, and be able to call .values (or .data) to get an ndarray, but have it remain mmapped (for purposes of shared-memory and so on).

It would also be nice if chunked dask operations could at least operate on the memory-mapped array until it actually needs to mutate something, which seems possible since dask seems to be designed around immutable arrays.

I did find a trick with xarray, though, which is to do like so:

data=np.load('file.npy', mmap_mode='r')
ds=xr.Dataset({'foo': (['dim1', 'dim2'], data)})

At this point, things like the following work without loading anything into memory:

np.sum(ds['foo'].values)
np.sum(ds['foo'][::2,:].values)

...xarray apparently doesn't know that the array is mmapped, and can't afford to impose a np.copy for cases like these.

Is there a "supported" way to do read-only memmapping (or copy-on write for that matter) in xarray or dask?

1 Answers1

4

xr.open_dataset with chunks= should not immediately load data into memory, it should create a dask.array, which evaluates lazily.

testfile = '/Users/mdurant/data/smith_sandwell_topo_v8_2.nc'
arr = xr.open_dataset(testfile, chunks={'latitude': 6336//11, 'longitude': 10800//15}).ROSE
arr 

<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)> dask.array</Users/mdurant/data/smith_sandwell_topo_v8_2.nc:/ROSE, shape=(6336, 10800), dtype=float64, chunksize=(576, 720)> Coordinates: * longitude (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ... * latitude (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ... Attributes: long_name: Topography and Bathymetry ( 8123m -> -10799m) units: meters valid_range: [-32766 32767] unpacked_missing_value: -32767.0 (note the dask.array in the above)

Many xarray operations on this may be lazy, and work chunkwise (and if you slice, only required chunks would be loaded)

arr.sum()

<xarray.DataArray 'ROSE' ()> dask.array<sum-aggregate, shape=(), dtype=float64, chunksize=()>

arr.sum().values    # evaluates

This is not the same as memory mapping, however, so I appreciate if this doesn't answer your question.

With dask's threaded scheduler, in-memory values are available to the other workers, so sharing would be quite efficient. Conversely, the distributed scheduler is quite good at recognising when results can be reused within a computation graph or between graphs.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I was thinking it would be cool if xarray/dask supported memory-mapping through-and-through. But on the other hand, lazy-loading of chunks is almost functionally equivalent. I assume that xarray is using the dask.threaded scheduler by default. I'm building a server around an xarray/CF-style data model (for web visualization apps to access & query via websocket), and am trying to decide whether to rely on the memmapping hack I described above along with my own parallelization, versus going full-in on dask. – chrisbarber Jun 26 '17 at 19:39
  • The data itself is read-only, but for some queries it may be useful to compute a dynamic mask array based on some parameters/filters. This is where a mutable sharedmem array might come in handy. Though using dask to compute the mask with each request might be fine as well. I have more investigating to do on my own. I will probably accept your answer though BTW since my question seems to be asking about unsupported/undocumented aspects of xarray which is a tall order. – chrisbarber Jun 26 '17 at 20:09
  • Yes, dask uses the threaded scheduler by default, and therefore so does xarray, unless you created a distributed Client. For HDF files, this is desirable, since there can be inter-process file locking issues otherwise. For interactive visualisation of large datasets via xarray/dask, you may wish to look into [datashader examples](https://github.com/bokeh/datashader/tree/master/examples). – mdurant Jun 26 '17 at 21:23
  • I was worried that a task graph such as `((arr>0.75)*arr).sum()` would eat up a lot of memory, since the `(arr>0.75)` intermediate calculation is the size of the full array. But now it seems obvious to me that the footprint will just be chunksize*Nthreads. I think this will work great for my application, without any need for memmapping. I am doing server-side statistical queries using functions such as `da.einsum`, while the visualization component is quite simple as of now. Datashader looks amazing, and thanks for the HDF locking tip. – chrisbarber Jun 27 '17 at 06:49