0

I have rolled my own code to let me find all the windows of a dataframe based on a timeoffset so I can later apply a function to a whole dataframe (compared to the vanilla .rolling() function in pandas, which operates on a single column at a time.) I drew inspiration from this helpful answer to another question.

This solution is working as expected in one of my environments (a hex notebook using pandas 1.3.0), but is not working in another (pycharm using pandas 1.4.2).

edit After rolling my pycharm version back to 1.3.0 it is working as expected, so it is either something introduced between 1.3.0 and 1.4.2 or a corrupted installation on my machine.

The key piece of code is:

def perform_rolling(df: pd.DataFrame, my_windows: list[tuple]):
    group_id = df[grouping_field_name].unique()[0]
    dfc = df.reset_index(drop=True)
    dfc.drop([grouping_field_name], inplace=True, axis=1)
    dfc.rolling(time_offset, on=time_field_name).apply(assign_windows, kwargs={'my_df': dfc, 'my_windows': my_windows, 'group_id': group_id})

For reference the assign_windows function is:

def assign_windows(ser: pd.Series, my_df: pd.DataFrame, my_windows: list[tuple], group_id):
    my_uids = list(my_df.loc[ser.index, 'uid'].values)
    # Python's rolling implementation will execute assign_windows() on each column, so we
    # restrict action to a single column to avoid duplicating windows.
    if -1 in ser.values:
        my_windows.append((group_id, my_uids))
    return 1  # This is a dummy return because pd.DataFrame.rolling expects numerical return values.

This is all happening in a closure that contains a my_windows list which gets returned to the calling code.

The problem is that in one of my environments the series sent into assign_windows() has had its index changed back to the time_field_name column, so the my_df.loc[ser.index, 'uid'].values line breaks because my_df is indexed to the default range index.

In my other environment, everything works as expected, and the series coming into assign_windows() still has the same index it had when .rolling() was applied.

Any help for preventing pandas from re-indexing the sequence to the on parameter would be appreciated.

David R
  • 994
  • 1
  • 11
  • 27
  • maybe `dfc = df.set_index(time_field_name)` and skip `on=...` would help. – Quang Hoang Jun 11 '22 at 18:56
  • I think that could work, but I could not use the index values then because there may be duplicates. I'm going to try just removing everything from the dataframe except the dateTime column and my uid column and then notate the uid column when it comes through, as it should be the only column then... [That may have been what you had in mind to begin with.] – David R Jun 24 '22 at 11:51

1 Answers1

0

It turns out that this is a bug/undocumented-change in Pandas. In rolling().apply() there is code added in version 1.4.1 that forces a re-indexing:

    def apply_func(values, begin, end, min_periods, raw=raw):
        if not raw:
            # GH 45912
            values = Series(values, index=self._on)
        return window_func(values, begin, end, min_periods)

I've opened an issue in the hopes that they will find a different solution to the issue that provoked this change.

David R
  • 994
  • 1
  • 11
  • 27