I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b
based on a window of time in the datetime_column
defined as the last N months. The window of time to look at however, is filtered by the value in column_a
.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
- Target date = 2020-01-01 11:00:00
- Target value in column_a = A
- Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
- Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
- Target date = 2019-11-01 10:00:00
- Target value in column_a = A
- Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
- Average would be the mean of rows 11
and so on...
I cannot use the grouper
since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.