1

This is the first time I am asking a question here, so let me know if you require more information to suggest a solution.

I have a three dimensional boolean data array (time, lat, lon), which I have processed using the xarray library in python 3. The data array I am working with includes data for one year and it has daily timesteps (365-366 days depending on whether it is a leap year or not).

The dimensions of a sample data array are shown here, and a sample netcdf file can be downloaded here and load as da = xr.open_dataset('data.nc'). This sample includes five time steps only.

I would like to know how long is the longest sequence of True values for each cell (or pixel). The output containing the longest sequence should be a two dimensional data array or data frame. So, for example, if a cell has values of [True, True, True, False, True], the result I want to get for that pixel is 3 as this represents the three consecutive True values.

I have tried using cumulative sums across time using da.cumsum('time'), but this adds up all values, even if they are not consecutive, but this is not what I want.

Similar questions have been asked before for two dimensional dataframes, for example here and here. But I have not been able to implement these solutions successfully in a three dimensional dataframe.

Since I am fairly new to Python and xarray in particular, I cannot figure out how I could achieve this. Any ideas would be appreciated.

lidefi87
  • 31
  • 4

2 Answers2

1

In case anyone needs a solution, I found one here. There, @tda suggests resetting the cumulative sum to zero every time a zero is encountered in the original data array using this line of code: cumulative = data.cumsum(dim='time')-data.cumsum(dim='time').where(data.values == 0).ffill(dim='time').fillna(0) where data is the original data array on which we are basing this cumulative sum calculation.

lidefi87
  • 31
  • 4
1

Thank you very much! I had to use this after a resample so I embedded your code in a function:

def n_longest_consecutive(ds, dim='time'):
    ds = ds.cumsum(dim=dim) - ds.cumsum(dim=dim).where(data == 0).ffill(dim=dim).fillna(0)
    return data.max(dim=dim)

I just deleted the .values inside the where: it forces the evaluation of data and the workflow is no longer 'lazy'.

cyril
  • 524
  • 2
  • 8