Get rolling average without every timestamp

Question

I have data about how many messages each account sends aggregated to an hourly level. For each row, I would like to add a column with the sum of the previous 7 days messages. I know I can groupby account and date and aggregate the number of messages to the daily level, but I'm having a hard time calculating the rolling average because there isn't a row in the data if the account didn't send any messages that day (and I'd like to not balloon my data by adding these in if at all possible). If I could figure out a way to calculate the rolling 7-day average for each day that each account sent messages, I could then re-join that number back to the hourly data (is my hope). Any suggestions?

Note: For any day not in the data, assume 0 messages sent.

Raw Data:

Account | Messages | Date       | Hour
12        5          2022-07-11   09:00:00
12        6          2022-07-13   10:00:00
12        10         2022-07-13   11:00:00
12        9          2022-07-15   16:00:00
12        1          2022-07-19   13:00:00
15        2          2022-07-12   10:00:00
15        13         2022-07-13   11:00:00
15        3          2022-07-17   16:00:00
15        4          2022-07-22   13:00:00

Desired Output:

Account | Messages | Date       | Hour      | Rolling Previous 7 Day Average
12        5          2022-07-11   09:00:00    0
12        6          2022-07-13   10:00:00    0.714
12        10         2022-07-13   11:00:00    0.714
12        9          2022-07-15   16:00:00    3
12        1          2022-07-19   13:00:00    3.571
15        2          2022-07-12   10:00:00    0
15        13         2022-07-13   11:00:00    0.286
15        3          2022-07-17   16:00:00    2.143
15        4          2022-07-22   13:00:00    0.429

score 1 · Accepted Answer · answered Nov 15 '22 at 21:46

I hope I've understood your question right:

df["Date"] = pd.to_datetime(df["Date"])
df["Messages_tmp"] = df.groupby(["Account", "Date"])["Messages"].transform(
    "sum"
)

df["Rolling Previous 7 Day Average"] = (
    df.set_index("Date")
    .groupby("Account")["Messages_tmp"]
    .rolling("7D")
    .apply(lambda x: x.loc[~x.index.duplicated()].shift().sum() / 7)
).values

df = df.drop(columns="Messages_tmp")

print(df)

Prints:

   Account  Messages       Date      Hour  Rolling Previous 7 Day Average
0       12         5 2022-07-11  09:00:00                        0.000000
1       12         6 2022-07-13  10:00:00                        0.714286
2       12        10 2022-07-13  11:00:00                        0.714286
3       12         9 2022-07-15  16:00:00                        3.000000
4       12         1 2022-07-19  13:00:00                        3.571429
5       15         2 2022-07-12  10:00:00                        0.000000
6       15        13 2022-07-13  11:00:00                        0.285714
7       15         3 2022-07-17  16:00:00                        2.142857
8       15         4 2022-07-22  13:00:00                        0.428571

Thanks, this is good progress I think. When I try to create the "Rolling 7 Day Average" column, I get an error "ValueError: cannot handle a non-unique multi-index!" Not sure exactly what I need to change though. — ErrorJordan, Nov 15 '22 at 22:19
@ErrorJordan Did you use the `.values` at the end? Anyway, if the error persists try to create small sample input that throws the error and put it into question. — Andrej Kesely, Nov 15 '22 at 22:20
Ah, that was it! Sorry - a copy/paste error. Thank you again! — ErrorJordan, Nov 15 '22 at 22:21

Get rolling average without every timestamp

1 Answers1