I have rolled my own code to let me find all the windows of a dataframe based on a timeoffset so I can later apply a function to a whole dataframe (compared to the vanilla .rolling() function in pandas, which operates on a single column at a time.) I drew inspiration from this helpful answer to another question.
This solution is working as expected in one of my environments (a hex notebook using pandas 1.3.0), but is not working in another (pycharm using pandas 1.4.2).
edit After rolling my pycharm version back to 1.3.0 it is working as expected, so it is either something introduced between 1.3.0 and 1.4.2 or a corrupted installation on my machine.
The key piece of code is:
def perform_rolling(df: pd.DataFrame, my_windows: list[tuple]):
group_id = df[grouping_field_name].unique()[0]
dfc = df.reset_index(drop=True)
dfc.drop([grouping_field_name], inplace=True, axis=1)
dfc.rolling(time_offset, on=time_field_name).apply(assign_windows, kwargs={'my_df': dfc, 'my_windows': my_windows, 'group_id': group_id})
For reference the assign_windows function is:
def assign_windows(ser: pd.Series, my_df: pd.DataFrame, my_windows: list[tuple], group_id):
my_uids = list(my_df.loc[ser.index, 'uid'].values)
# Python's rolling implementation will execute assign_windows() on each column, so we
# restrict action to a single column to avoid duplicating windows.
if -1 in ser.values:
my_windows.append((group_id, my_uids))
return 1 # This is a dummy return because pd.DataFrame.rolling expects numerical return values.
This is all happening in a closure that contains a my_windows list which gets returned to the calling code.
The problem is that in one of my environments the series sent into assign_windows()
has had its index changed back to the time_field_name
column, so the my_df.loc[ser.index, 'uid'].values
line breaks because my_df is indexed to the default range index.
In my other environment, everything works as expected, and the series coming into assign_windows()
still has the same index it had when .rolling()
was applied.
Any help for preventing pandas from re-indexing the sequence to the on
parameter would be appreciated.