I want to create time dependent mean encoding and time dependent average values from previous history in the same dataset. Consider the following reproducible example:
df = pd.DataFrame({
'customer_id': list(np.arange(0, 3)) * 5,
'date': pd.date_range('2019-01-01', '2019-01-15'),
'feature_1': np.linspace(0, 100, 15),
'feature_2': ['hi', 'hi', 'hi', 'bye', 'bye'] * 3,
'target': [0, 1, 0, 1, 1] * 3
})
df.head()
customer_id date feature_1 feature_2 target
0 0 2019-01-01 0.000000 hi 0
1 1 2019-01-02 7.142857 hi 1
2 2 2019-01-03 14.285714 hi 0
3 0 2019-01-04 21.428571 bye 1
4 1 2019-01-05 28.571429 bye 1
I create a previous time colume grouped by customer_id
df['prev_date'] = df.groupby('customer_id')['date'].transform(lambda x: x.shift())
customer_id date feature_1 feature_2 target prev_date
2 2 2019-01-03 14.285714 hi 0 NaT
5 2 2019-01-06 35.714286 hi 0 2019-01-03
8 2 2019-01-09 57.142857 bye 1 2019-01-06
11 2 2019-01-12 78.571429 hi 1 2019-01-09
14 2 2019-01-15 100.000000 bye 1 2019-01-12
Here's where I get stuck. I want to efficiently compute the output below on a reasonably big dataset. Bonus points if you could do this well in vaex / dask or some other library that's good for data that's too big to fit into memory.
Expected output for customer_id == 2
I've tried doing things similar to df.groupby('customer_id').apply(lambda x: x[x['date'] > x['prev_date']]['target'].mean())
but I have had not had success. Thanks in advance!