For later processing convenience, I want to be able to create a very large view onto an xarray DataArray. Here's a small example that works:
data = xr.DataArray([list(i+np.arange(10,20)) for i in range(5)], dims=["t", "x"])
indices = xr.DataArray(list([i]*20 for i in range(len(data))), dims=["y", "i"])
print(data)
print(indices)
selection = data.isel(t=indices)
print("selection:")
print(selection)
Output:
<xarray.DataArray (t: 5, x: 10)>
array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
[12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
[13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])
Dimensions without coordinates: t, x
<xarray.DataArray (y: 5, i: 20)>
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]])
Dimensions without coordinates: y, i
selection:
<xarray.DataArray (y: 5, i: 20, x: 10)>
array([[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]],
...
[[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]])
Dimensions without coordinates: y, i, x
So we have a data array, an indexing array, and we use vectorised indexing to select from the first array into a structure that is convenient for later use. So far, great. But it looks like xarray isn't very smart about this, and is copying a lot of data rather than using views back on to the original data (where we have many views on to the same data so it shouldn't take so much RAM to create the view).
To demonstrate the problem, here's a scaled up version:
data = xr.DataArray([list(i+np.arange(10,20000)) for i in range(5)], dims=["t", "x"])
indices = xr.DataArray(list([i]*50000 for i in range(len(data))), dims=["y", "i"])
selection = data.isel(t=indices)
print("big selection")
print(selection)
Output:
Traceback (most recent call last):
File "xarray_test.py", line 15, in <module>
selection = data.isel(t=indices)
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1183, in isel
ds = self._to_temp_dataset()._isel_fancy(
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/dataset.py", line 2389, in _isel_fancy
new_var = var.isel(indexers=var_indexers)
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/variable.py", line 1156, in isel
return self[key]
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/variable.py", line 777, in __getitem__
data = as_indexable(self._data)[indexer]
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/indexing.py", line 1159, in __getitem__
return array[key]
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/nputils.py", line 126, in __getitem__
return np.moveaxis(self._array[key], mixed_positions, vindex_positions)
numpy.core._exceptions.MemoryError: Unable to allocate 37.2 GiB for an array with shape (5, 50000, 19990) and data type int64
This shouldn't take 40 GB of RAM since its just lots of views on to the same data. Yeah there is some overhead in the indexing, but we should only need one index per row of 20000 in data
. We shouldn't have to copy over that row of 20000 into a new array.
Is there a way to make xarray do this in a smarter way? Can I more explicitly tell it to use Dask or something, or structure things differently somehow?
Actually I'm also totally happy just doing this with straight up numpy if that's easier, or any other library that can do this efficiently. I just used xarray because I thought it would do something smart with this operation, but I guess not, at least not automatically.
Edit: Ok I just found this question suggesting it is impossible with numpy: Can I get a view of a numpy array at specified indexes? (a view from "fancy indexing"). Not sure if this implies that xarray can't do it either though...
Edit 2: Ok the documentations for xarray.DataArray.sel (https://docs.xarray.dev/en/stable/generated/xarray.Dataset.sel.html) says this:
Returns obj (Dataset) – A new Dataset with the same contents as this dataset, except each variable and dimension is indexed by the appropriate indexers. If indexer DataArrays have coordinates that do not conflict with this object, then these coordinates will be attached. In general, each array’s data will be a view of the array’s data in this dataset, unless vectorized indexing was triggered by using an array indexer, in which case the data will be a copy.
So I guess xarray does try to be smart about selections in general, but not in the vectorised indexing case I want, which is rather annoying... I guess I have to think of a way to do this without vectorised indexing...