5

How do I add a new DataArray to an existing Dataset without overwriting the whole thing? The new DataArray shares some coordinates with the existing one, but also has new ones. In my current implementation, the Dataset gets completely overwritten, instead of just adding the new stuff.

The existing DataArray is a chunked zarr-backed DirectoryStore (though I have the same problem for an S3 store).

import numpy as np
import xarray as xr
import zarr

arr1 = xr.DataArray(np.random.randn(2, 3),
                   [('x', ['a', 'b']), ('y', [10, 20, 30])],
                   name='arr1')

ds = arr1.chunk({'x': 1, 'y': 3}).to_dataset()

ds looks like this:

<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30
Data variables:
    arr1     (x, y) float64 dask.array<shape=(2, 3), chunksize=(1, 3)>

I write it to a directory store:

store = zarr.DirectoryStore('test.zarr')
z = ds.to_zarr(store, group='arr', mode='w')

It looks good:

$ ls -l test.zarr/arr
total 0
drwxr-xr-x  6 myuser  mygroup  204 Sep 21 11:03 arr1
drwxr-xr-x  5 myuser  mygroup  170 Sep 21 11:03 x
drwxr-xr-x  5 myuser  mygroup  170 Sep 21 11:03 y

I create a new DataArray that shares some coordinates with the existing one, and add it to the existing Dataset. I'll read the existing Dataset first, since that's what I'm doing in practice.

ds2 = xr.open_zarr(store, group='arr')
arr2 = xr.DataArray(np.random.randn(2, 3),
                   [('x', arr1.x), ('z', [1, 2, 3])],
                   name='arr2')
ds2['arr2'] = arr2

The updated Dataset looks fine:

<xarray.Dataset>
Dimensions:  (x: 2, y: 3, z: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30
  * z        (z) int64 1 2 3
Data variables:
    arr1     (x, y) float64 dask.array<shape=(2, 3), chunksize=(1, 3)>
    arr2     (x, z) float64 0.4728 1.118 0.7275 0.4971 -0.3398 -0.3846

...but I can't write to it without a complete overwrite.

# I think I'm "appending" to the group `arr`
z2 = ds2.to_zarr(store, group='arr', mode='a')

This gives me a ValueError: The only supported options for mode are 'w' and 'w-'.

# I think I'm "creating" the new arr2 array in the arr group
z2 = ds2.to_zarr(store, group='arr', mode='w-')

This gives me ValueError: path 'arr' contains a group.

The only thing that worked is z2 = ds2.to_zarr(store, group='arr', mode='w'), but this completely overwrites the group.

The original DataArray is actually quite large in my problem, so I really don't want to re-write it. Is there a way to only write the new DataArray?

Thank you!

jkmacc
  • 6,125
  • 3
  • 30
  • 27

2 Answers2

6

It's been a while since this question was posted - but maybe it's still relevant ad helpful to someone (to me it was!)

Version 0.16.2 of xarray introduced the keyword region to to_zarr, which lets you write to limited region of a zarr file. This seemingly enables you to add a new variable to an existing dataset, without overwriting it entirely.

My solution picks up after you've written ds to the zarr and created the new ds2 in memory, just before writing it back.

First, I'm keeping the modified times of each zarr content in a dictionary, to check after the second write if indeed nothing has changed:

import os
import glob

mtimes = {}
contents = list(glob.glob("test.zarr/arr/*"))
for c in contents:
    mtimes.update({c: os.path.getmtime(c)})

Now I can write back the new variable. To use the region keyword, I need to drop any variables that exist already and are the same for both variables:

ds2_dropped = ds2.drop(["x", "y", "z", "arr1"])

Now I can write the new variable and check the modified times if indeed nothing has changed:

ds2_dropped.to_zarr("test.zarr/", mode="a", group='arr', region={"x": slice(0, ds2.x.size), "z": slice(0, ds2.z.size)})

for c in contents:
    assert os.path.getmtime(c) == mtimes[c]

# all good!

And if we load the dataset from zarr again, we can see that the new variable has been added sucessfully:

print(xr.open_zarr("test.zarr/", group="arr"))

<xarray.Dataset>
Dimensions:  (x: 2, y: 3, z: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30
Dimensions without coordinates: z
Data variables:
    arr1     (x, y) float64 dask.array<chunksize=(1, 3), meta=np.ndarray>
    arr2     (x, z) float64 dask.array<chunksize=(2, 3), meta=np.ndarray>

Val
  • 6,585
  • 5
  • 22
  • 52
3

Unluckily, this is (to the best of my knowledge) currently not possible. The to_zarr in append mode is implemented to add new entries to a dimension and not variables to already written entries.

@davidbrochart wrote a good example in the original MR for the use case:

import xarray as xr
import pandas as pd

ds0 = xr.Dataset({'temperature': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
ds1 = xr.Dataset({'temperature': (['time'],  [53, 54, 55])}, coords={'time': pd.date_range('2000-01-04', periods=3)})

ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')

ds2 = xr.open_zarr('temp')

You will see that ds2 is the concatenated version of ds0 and ds1 over the time dimension.

The good news is that there is an option to interact with the zarr store directly. If you look at the implementation xarray uses, you can see that adding new variables is actually a possibility in the underlying zarr library. This is however not implemented in the xarray API.

Jendrik
  • 206
  • 2
  • 8