3

I would like to produce a zarr array pointing to part of a zarr array on disk, similar to how sliced = np_arr[5] gives me a view into np_arr, such that modifying the data in sliced modifies the data in np_arr. Example code:

import matplotlib.pyplot as plt
import numpy as np
import zarr


arr = zarr.open(
    'temp.zarr',
    mode='a',
    shape=(4, 32, 32),
    chunks=(1, 16, 16),
    dtype=np.float32,
)
arr[:] = np.random.random((4, 32, 32))

fig, ax = plt.subplots(1, 2)
arr[2, ...] = 0  # works fine, "wipes" slice 2
ax[0].imshow(arr[2])  # all 0s

arr_slice = arr[1]  # returns a NumPy array — loses ties to zarr on disk
arr_slice[:] = 0
ax[1].imshow(arr[1])  # no surprises — shows original random data

plt.show()

Is there anything I can write instead of arr_slice = arr[1] that will make arr_slice be a (writeable) view into the arr array on disk?

Juan
  • 5,433
  • 21
  • 23

2 Answers2

5

The TensorStore library is specifically designed to do this --- all indexing operations produce lazy views:

import tensorstore as ts
import numpy as np
arr = ts.open({
  'driver': 'zarr',
  'kvstore': {
    'driver': 'file',
    'path': '.',
  },
  'path': 'temp.zarr',
  'metadata': {
    'dtype': '<f4',
    'shape': [4, 32, 32],
    'chunks': [1, 16, 16],
    'order': 'C',
    'compressor': None,
    'filters': None,
    'fill_value': None,
  },
}, create=True).result()
arr[1] = 42  # Overwrites, just like numpy/zarr library
view = arr[1] # Returns a lazy view, no I/O performed
np.array(view) # Reads from the view
# Returns JSON spec that can be passed to `ts.open` to reopen the view.
view.spec().to_json()

You can read more about the "index transform" mechanism that underlies these lazy views here: https://google.github.io/tensorstore/index_space.html#index-transform https://google.github.io/tensorstore/python/indexing.html

Disclaimer: I'm an author of TensorStore.

jbms
  • 236
  • 1
  • 2
4

One way to do this would be with a custom store object. You could subclass DirectoryStore or whatever other base store your data are in and override the getitem / setitem methods. This is probably harder than you wish it were.

A better option would be to copy Xarray's LazilyIndexedArray types, which are a piece of magic by written by Stephan Hoyer: https://github.com/pydata/xarray/blob/master/xarray/core/indexing.py#L516. I think these do exactly what you want. They are not part of Xarray's public API, but IMO they are so useful they should actually be in a standalone package.

Also nice related blog post about this here: https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671

Ryan
  • 766
  • 6
  • 13
  • Thanks! That is indeed harder than I wished. Jeremy's library is probably what I'll use for now. Do you think there is scope to revisit the eager getitem design of zarr-python? If so I would open an issue, but I realise this is a massive design change. But it could be an attribute set at creation-time, `lazy=False` (default)... But anyway the blog post and links are super useful. – Juan Nov 23 '20 at 23:08
  • 1
    For future code archeologists, more discussion on Twitter: https://twitter.com/shoyer/status/1331020787155828738 – Juan Nov 24 '20 at 00:23