2

I have some problems getting this to work properly and I'm also open to other suggestions as I'm not 100% sure if I'm going the right way with this.

Here is some simple dummy data:

times = pd.date_range(start='2012-01-01',freq='1W',periods=25)
x = np.array([range(0,20)]).squeeze()
y = np.array([range(20,40)]).squeeze()

data = np.random.randint(3, size=(25,20,20))

ds = xr.DataArray(data, dims=['time', 'y', 'x'], coords = {'time': times, 'y': y, 'x': x})

For each x,y-coordinate, I want to return the longest sequence of 1s or 2s over time. So my input array is 3D (time, x, y) and my output 2D (x, y). The code in 'seq_gufunc' is inspired by this thread. My actual dataset is much larger (with landuse classes instead of 1s, 2s, etc) and this is only a small part of a bigger workflow, where I'm also using dask for parallel processing. So in the end this should run fast and efficiently, which is why I ended up trying to figure out how to get numba's @guvectorize and Xarray's apply_ufunc to work together:


@guvectorize(
    "(int64[:], int64[:])",
    "(n) -> (n)", target='parallel', nopython=True
)
def seq_gufunc(x, out):

    f_arr = np.array([False])

    bool_stack = np.hstack((f_arr, (x == 1) | (x == 2), f_arr))

    # Get start, stop index pairs for sequences
    idx_pairs = np.where(np.diff(bool_stack))[0].reshape(-1, 2)

    # Get length of longest sequence
    longest_seq = np.max(np.diff(idx_pairs))

    out[:] = longest_seq


## Input for dim would be: 'time' 
def apply_seq_gufunc(data, dim):

    return xr.apply_ufunc(seq_gufunc,
                          data,
                          input_core_dims=[[dim]],
                          exclude_dims=set((dim,)), 
                          dask="allowed") 

There are probably some very obvious mistakes that hopefully someone can point out. I have a hard time understanding what actually goes on in the background and how I should set up the layout-string of @guvectorize and the parameters of apply_ufunc so that it does what I want.


EDIT2: This is the working solution. See @OriolAbril 's answer for more information about the parameters of apply_ufunc and guvectorize. It was also necessary to implement the if...else... clause in case no values match and to avoid the ValueError that would be raised.

@guvectorize(
    "(int64[:], int64[:])",
    "(n) -> ()", nopython=True
)
def seq_gufunc(x, out):

    f_arr = np.array([False])
    bool_stack = np.hstack((f_arr, (x == 1) | (x == 2), f_arr))

    if np.sum(bool_stack) == 0:
        longest_seq = 0

    else:
        # Get start, stop index pairs for sequences
        idx_pairs = np.where(np.diff(bool_stack))[0].reshape(-1, 2)

        # Get length of longest sequence
        longest_seq = np.max(np.diff(idx_pairs))   

    out[:] = longest_seq


def apply_seq_gufunc(data, dim):

    return xr.apply_ufunc(seq_gufunc,
                          data,
                          input_core_dims=[[dim]],
                          dask="parallelized",
                          output_dtypes=['uint8']
                         )
maawoo
  • 25
  • 4
  • Doesn't solve the problem, but might be still interesting for you: `guvectorize` won't work with `dask.Client` in case you were planning to use a cluster or so. At least it didn't (not sure if they adapted it - I would need to search for the conversation about it somewhere) – Val May 27 '20 at 13:06
  • Hi Val! I already have another `@guvectorize` function implemented and initially used a local cluster via `dask.Client` until some random segfaults occured. Now I'm using `.compute(scheduler='single-threaded')`as an interim solution which seems (?) to work. But good to know what actually caused those segfaults. I was loosing my mind for a little while. – maawoo May 27 '20 at 13:21
  • Yeah, I spent a good day googling what could be wrong until I found a small conversation somewhere, that `guvectorize` functions can't be passed to the scheduler and workers ... I wish I remember where this thread was, maybe it comes back to my mind. Anyways, using `dask` on a single machine works fine with it, but you're stuck with that. – Val May 27 '20 at 13:32

1 Answers1

1

I'd point you out to How to apply a xarray u_function over NetCDF and return a 2D-array (multiple new variables) to the DataSet, the immediate goal is not the same, but the detailed description and examples should clarify the issue.

In particular, you are right in using time as input_core_dims (in order to make sure it is moved to the last dimension) and it is correctly formatted as a list of lists, however, you do not need excluded_dims but output_core_dims==[["time"]].

The output has the same shape as the input, however, as explained in the link above, apply_ufunc expects it will have same shape as broadcasted dims. output_core_dims is needed to get apply_ufunc to expect output with dims y, x, time.

OriolAbril
  • 7,315
  • 4
  • 29
  • 40
  • Thanks for your answer. Are you sure about `output_core_dims` ? Because when I use `apply_ufunc` like this: `xr.apply_ufunc(seq_gufunc, data, input_core_dims=[[dim]], dask="allowed")` then it will return an array of shape `(20, 20)` (which is what I want). But if I also set `output_core_dims=[["time"]]`, it'll return an array of shape `(20, 20, 25)`. But even though I achieved a reduction of my array, I got some weird numbers. – maawoo May 27 '20 at 12:24
  • After some digging I found out that my function `seq_gufunc` receives the entire array at once of shape `(20, 20, 25)`, instead of one array per x,y coordinate in the shape `(1, 1, 25)`, which I expected. I also made some minor changes of my code that I included in my edited original post. – maawoo May 27 '20 at 12:39
  • 1
    To apply the function on a 1d basis (each `(1,1,25)` piece is a 1D array) and get guvectorize to broadcast over the extra dimensions, you have to use `(n) -> ()`, using `(m, m, n) -> (m, m)` is actually preventing the desired broadcast. You'd then have to use `time` as input core dim and nothing as output core dim. – OriolAbril May 27 '20 at 16:50
  • 1
    Thank you so much! I realized the mistake I made by using `(m, m, n) -> (m, m)` yesterday after following [this guide](http://xarray.pydata.org/en/stable/examples/apply_ufunc_vectorize_1d.html). But still couldn't figure out how to get a 2D array out instead of 3D. – maawoo May 28 '20 at 07:03