0

I want to create time dependent mean encoding and time dependent average values from previous history in the same dataset. Consider the following reproducible example:

df = pd.DataFrame({
    'customer_id': list(np.arange(0, 3)) * 5,
    'date': pd.date_range('2019-01-01', '2019-01-15'),
    'feature_1': np.linspace(0, 100, 15),
    'feature_2': ['hi', 'hi', 'hi', 'bye', 'bye'] * 3,
    'target': [0, 1, 0, 1, 1] * 3
})
df.head()
     customer_id          date   feature_1  feature_2   target
0              0    2019-01-01    0.000000         hi        0
1              1    2019-01-02    7.142857         hi        1
2              2    2019-01-03   14.285714         hi        0
3              0    2019-01-04   21.428571        bye        1
4              1    2019-01-05   28.571429        bye        1

I create a previous time colume grouped by customer_id

df['prev_date'] = df.groupby('customer_id')['date'].transform(lambda x: x.shift())
   customer_id           date      feature_1    feature_2   target     prev_date
2            2     2019-01-03      14.285714           hi        0           NaT
5            2     2019-01-06      35.714286           hi        0    2019-01-03
8            2     2019-01-09      57.142857          bye        1    2019-01-06
11           2     2019-01-12      78.571429           hi        1    2019-01-09
14           2     2019-01-15     100.000000          bye        1    2019-01-12

Here's where I get stuck. I want to efficiently compute the output below on a reasonably big dataset. Bonus points if you could do this well in vaex / dask or some other library that's good for data that's too big to fit into memory.

Expected output for customer_id == 2 Previous history grouped aggregation output

I've tried doing things similar to df.groupby('customer_id').apply(lambda x: x[x['date'] > x['prev_date']]['target'].mean()) but I have had not had success. Thanks in advance!

Matt Elgazar
  • 707
  • 1
  • 8
  • 21
  • `df['prev_feature_1_mean'] = df.groupby('customer_id')['feature_1'].apply(lambda x:x.shift().expanding().mean())` – Nick Jul 02 '22 at 04:32
  • Except if your dataframe are not sorted by data, I don't understand why you check `x['date'] > x['prev_date']` as you shift the date per customer_id? – Corralien Jul 02 '22 at 04:33
  • I actually had just figured the first part out right after I posted the question. I didn't know about the `expanding` function. I don't know a good way to go about the second part of the question. I was hoping to have the same functionality as sklearn `TimeDepTargetMeanEncoder` class that takes params `X`, `y,` and either `prev_history_series` or a dict of the column and the mean values with the number of rows used). That way new transformations will reference the prev_history for that column. – Matt Elgazar Jul 02 '22 at 13:48

0 Answers0