0

For later processing convenience, I want to be able to create a very large view onto an xarray DataArray. Here's a small example that works:

data = xr.DataArray([list(i+np.arange(10,20)) for i in range(5)], dims=["t", "x"])
indices = xr.DataArray(list([i]*20 for i in range(len(data))), dims=["y", "i"])

print(data)
print(indices)
selection = data.isel(t=indices)
print("selection:")
print(selection)

Output:

<xarray.DataArray (t: 5, x: 10)>
array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
       [12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
       [13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
       [14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])
Dimensions without coordinates: t, x
<xarray.DataArray (y: 5, i: 20)>
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]])
Dimensions without coordinates: y, i
selection:
<xarray.DataArray (y: 5, i: 20, x: 10)>
array([[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]],
...
       [[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]])
Dimensions without coordinates: y, i, x

So we have a data array, an indexing array, and we use vectorised indexing to select from the first array into a structure that is convenient for later use. So far, great. But it looks like xarray isn't very smart about this, and is copying a lot of data rather than using views back on to the original data (where we have many views on to the same data so it shouldn't take so much RAM to create the view).

To demonstrate the problem, here's a scaled up version:

data = xr.DataArray([list(i+np.arange(10,20000)) for i in range(5)], dims=["t", "x"])
indices = xr.DataArray(list([i]*50000 for i in range(len(data))), dims=["y", "i"])
selection = data.isel(t=indices)
print("big selection")
print(selection)

Output:

Traceback (most recent call last):
  File "xarray_test.py", line 15, in <module>
    selection = data.isel(t=indices)
  File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1183, in isel
    ds = self._to_temp_dataset()._isel_fancy(
  File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/dataset.py", line 2389, in _isel_fancy
    new_var = var.isel(indexers=var_indexers)
  File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/variable.py", line 1156, in isel
    return self[key]
  File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/variable.py", line 777, in __getitem__
    data = as_indexable(self._data)[indexer]
  File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/indexing.py", line 1159, in __getitem__
    return array[key]
  File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/nputils.py", line 126, in __getitem__
    return np.moveaxis(self._array[key], mixed_positions, vindex_positions)
numpy.core._exceptions.MemoryError: Unable to allocate 37.2 GiB for an array with shape (5, 50000, 19990) and data type int64

This shouldn't take 40 GB of RAM since its just lots of views on to the same data. Yeah there is some overhead in the indexing, but we should only need one index per row of 20000 in data. We shouldn't have to copy over that row of 20000 into a new array.

Is there a way to make xarray do this in a smarter way? Can I more explicitly tell it to use Dask or something, or structure things differently somehow?

Actually I'm also totally happy just doing this with straight up numpy if that's easier, or any other library that can do this efficiently. I just used xarray because I thought it would do something smart with this operation, but I guess not, at least not automatically.

Edit: Ok I just found this question suggesting it is impossible with numpy: Can I get a view of a numpy array at specified indexes? (a view from "fancy indexing"). Not sure if this implies that xarray can't do it either though...

Edit 2: Ok the documentations for xarray.DataArray.sel (https://docs.xarray.dev/en/stable/generated/xarray.Dataset.sel.html) says this:

Returns obj (Dataset) – A new Dataset with the same contents as this dataset, except each variable and dimension is indexed by the appropriate indexers. If indexer DataArrays have coordinates that do not conflict with this object, then these coordinates will be attached. In general, each array’s data will be a view of the array’s data in this dataset, unless vectorized indexing was triggered by using an array indexer, in which case the data will be a copy.

So I guess xarray does try to be smart about selections in general, but not in the vectorised indexing case I want, which is rather annoying... I guess I have to think of a way to do this without vectorised indexing...

Ben Farmer
  • 2,387
  • 1
  • 25
  • 44

1 Answers1

0

Ok so grabbing just one index at a time we can do it without blowing out the ram:

data = xr.DataArray([list(i+np.arange(10,20000)) for i in range(5)], dims=["t", "x"])
indices = xr.DataArray(list([i]*50000 for i in range(len(data))), dims=["y", "i"])

# Ram blowout
#selection = data.isel(t=indices)
#print("big selection")
#print(selection)

# Can do it one index at a time I guess?
ilen, jlen = indices.shape
out_data = [[data.isel(t=indices[i,j]) for i in range(ilen)] for j in range(jlen)]
print("out_data[0][0]:")
print(out_data[0][0])
print("out_data[0][1]:")
print(out_data[0][1])

Output:

out_data[0][0]:
<xarray.DataArray (x: 19990)>
array([   10,    11,    12, ..., 19997, 19998, 19999])
Dimensions without coordinates: x
out_data[0][1]:
<xarray.DataArray (x: 19990)>
array([   11,    12,    13, ..., 19998, 19999, 20000])
Dimensions without coordinates: x

But of course this loses rather a lot of the xarray convenience I was looking for...

Ben Farmer
  • 2,387
  • 1
  • 25
  • 44
  • You could do it one *dimension* at a time, using a numpy array as an index (rather than a DataArray). But yeah a 2D view into one dimension of a 2D array when numpy’s advanced indexing doesn’t support this is asking a lot. Xarray doesn’t implement *any* array backend - it’s purely coordinate management on top of numpy, dask, or another backed. So i think your issue is with numpy. – Michael Delgado Feb 16 '23 at 04:20
  • Yeah looks like that's the case... it's a shame since doing this kind of thing with views would be very powerful. I can see that allowing it in the general case of fancy indexing is basically impossible and not even always better than copying, but there are certainly many special cases where it's the obvious thing to do (but maybe not so easy to detect algorithmically...). Maybe there is something that can be done to manually stack a bunch of views together? Like I can get the individual views, just need to stack them together perhaps? – Ben Farmer Feb 16 '23 at 22:44
  • Ok seems like that's probably not possible either due to the thing where numpy arrays have to be contiguous in memory. A bunch of stacked up views will definitely not be contiguous. Ugh, maybe I should just write some C code to do this... – Ben Farmer Feb 16 '23 at 22:54
  • Yeah I don’t think so. This would have to be a change upstream in numpy. What is it you’re ultimately trying to do? – Michael Delgado Feb 17 '23 at 00:29
  • I have a gridded rainfall timeseries, like (time, lat, lon), over a roughly 1000x1000 grid, at daily frequency over 20 years. I also have point timeseries from ~1400 rainfall gauges in the same area, but these are at hourly frequency. I am trying to insert distance weighted averages of the gauge timeseries into the gridded datasets, in the cells with gauges sufficiently nearby (to make them hourly frequency, and just using the daily average values where there isn't gauge data. So there is a bunch of selecting grid cells and inserting values into them over a long time. – Ben Farmer Feb 17 '23 at 03:20
  • Also the averaging changes with time slice since the gauge data has gaps in it, i.e. gauges drop in an out of the averaging step. I think I'm pretty much just settling on doing it hour by hour, vectorised over the grid as efficiently as possible. Might not be able to get much faster than that even if xarray/numpy were smarter with the indexing. – Ben Farmer Feb 17 '23 at 03:22
  • What are you doing with the result? Are you using the full hour * lat * lon dataset, extracting other points, or saving it? You should be able to do this with dask - have you considered selecting with a chunked index array? – Michael Delgado Feb 17 '23 at 04:08
  • Saving it. Currently to netcdf, though ultimately probably to a weird custom format which may hinder use of dask etc. Also how will dask help? Do you think the fancy indexing thing could work there without doing copies? I assumed if it didn't work in numpy it probably won't work in dask either. – Ben Farmer Feb 17 '23 at 05:20
  • So you’re going to create a huge netcdf with all these repeats? Or are you combining it with another dataset first, and if so, how? This is where dask might come in, if you were to e.g. add it to a chunked dataset… also I’d recommend zarr if saving a chunked array – Michael Delgado Feb 17 '23 at 06:24
  • Nah I'll save them as one per day, so maybe 7000 or so netcdf files. Or maybe monthly. But that's mostly just to make checking them easier, they actually have to be written to a custom binary format, one per day, because that's the input format for some simulation software. But the whole thing is embarrassingly parallel so I can split it into as many jobs as fit on a machine. So I guess dask can't really improve on that. – Ben Farmer Feb 17 '23 at 09:19
  • Yeah - I’d split this reshape job up too if you’re really detereined to save the full reshaped array. If you’re going to write it out it needs to exist in memory as a contiguous array at some point, so you’ll need to partition it somehow. – Michael Delgado Feb 17 '23 at 09:43
  • Yeah the weird shape is because I take the gauge timeseries and stack 8 of them into each grid cell, to get a (time, 8, lat, long) array. Then I do the weighted average by summing over that 8 dimension. So it can go back to being contiguous at that point i figured. But I've gone and just optimised this at the grid level, iterating over time slices and making sure to precompute some stuff and only deal with the grid cells that are within range of gauges etc, and it's reasonably fast now. Probably nothing could do much better unless I write some C code dedicated to it, maybe Cython. – Ben Farmer Feb 17 '23 at 19:56
  • What’s the 8 dimension? Could you do this with a dot product rather than a select and then sum? – Michael Delgado Feb 17 '23 at 20:32
  • The 8 dim are the 8 gauges assigned to each grid cell. And sure, it's a weighted average that gets done to reduce the dim so that's a dot product of the weights and values. But this is happening after the selection, we do the selection to construct the big (time, 8, latxlong) array in the first place. Actually it's smaller than latxlon since we can preselect which grid cells are involved in this calculation. – Ben Farmer Feb 18 '23 at 06:19