1

Could you use a window function on groups, something in feature engine? I have been reading the docs and trying to find some clarity on how to do this but it seems like something that should exist but I can't seem to find how its implemented.

import pandas as pd

# create a sample dataframe with groups
df = pd.DataFrame({'group': ['A', 'A','A', 'B', 'B', 'B','B', 'C', 'C', 'C','C'],
                   'value': [1, 2, 3, 4, 5, 6, 7, 8,9,10,11]})

# group the data by the 'group' column and apply a rolling window mean of size 2
rolling_mean = df.groupby('group')['value'].rolling(window=2).mean()

print(rolling_mean)

I am guessing it would look something like this.

from feature_engine.timeseries.forecasting import WindowFeatures

wf = WindowFeatures(
window_size=3,
variables=["value"],
operation=["mean"],
groupby_cols=["group"]
)

transformed_df = wf.fit_transform(df)

I can't seem to find a group_by (groupby_cols) parameter in feature-engine?

It would be great to see other ways of standardising feature engineering for time series data like this, perhaps from sktime or any other framework too.

user4933
  • 1,485
  • 3
  • 24
  • 42

1 Answers1

0

As you want to apply this operation individually for each group, you can use groupby_apply:

wf = WindowFeatures(window=3, variables=["value"], functions=["mean"])

# same as pd.concat([wf.fit_transform(X) for _, X in df.groupby('group')])
out = df.groupby('group', group_keys=False).apply(wf.fit_transform)

Output:

>>> out
   group  value  value_window_3_mean
0      A      1                  NaN
1      A      2                  NaN
2      A      3                  NaN
3      B      4                  NaN
4      B      5                  NaN
5      B      6                  NaN
6      B      7                  5.0
7      C      8                  NaN
8      C      9                  NaN
9      C     10                  NaN
10     C     11                  9.0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Amazing :D This is great, but apply takes long, could there be an optimized approach? – user4933 Mar 09 '23 at 20:05
  • No more time than a comprehension. It depends on the number of groups (and not the number of rows). You can also just numba or multiprocessing to speed up the operations. – Corralien Mar 09 '23 at 20:07
  • Do you think I could use numba on a dataframe without numpy – user4933 Mar 09 '23 at 20:19