3

This answer provides a solution to get a rolling sum of a column grouped by another column based on a date window. To reproduce it here:

df = pd.DataFrame(
    {
        'ID': {0: 10001, 1: 10001, 2: 10001, 3: 10001, 4: 10002, 5: 10002, 6: 10002},
        'Date': {
            0: datetime.datetime(2019, 7, 1),
            1: datetime.datetime(2019, 5, 1),
            2: datetime.datetime(2019, 6, 25),
            3: datetime.datetime(2019, 5, 27),
            4: datetime.datetime(2019, 6, 29),
            5: datetime.datetime(2019, 7, 18),
            6: datetime.datetime(2019, 7, 15)
        },
        'Amount': {0: 50, 1: 15, 2: 10, 3: 20, 4: 25, 5: 35, 6: 40},
    }
)
amounts = df.groupby(["ID"]).apply(lambda g: g.sort_values('Date').rolling('28d', on='Date').sum())
df['amount_4wk_rolling'] = df["Date"].map(amounts.set_index('Date')['Amount'])

Output:

+-------+------------+--------+--------------------+
|  ID   |    Date    | Amount | amount_4wk_rolling |
+-------+------------+--------+--------------------+
| 10001 | 01/07/2019 |     50 |                 60 |
| 10001 | 01/05/2019 |     15 |                 15 |
| 10001 | 25/06/2019 |     10 |                 10 |
| 10001 | 27/05/2019 |     20 |                 35 |
| 10002 | 29/06/2019 |     25 |                 25 |
| 10002 | 18/07/2019 |     35 |                100 |
| 10002 | 15/07/2019 |     40 |                 65 |
+-------+------------+--------+--------------------+

However, if two of the dates are the same then I get the error:

pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

This makes sense as I can see on the final line that Date is being used to set an index which is now no longer unique. However, as I don't really understand what that final line does I'm little stumped on trying to develop an alternative solution.

Could someone help out?

Jossy
  • 589
  • 2
  • 12
  • 36
  • 1
    The last line is using the `amounts` dataframe as a pseudo-dictionary to look up values in the "Amount" column and populate the "amount_4wk_rolling" column – Paul H Dec 17 '21 at 20:22
  • 1
    I think my [answer here](https://stackoverflow.com/questions/69979727/how-can-i-fill-a-column-with-values-that-are-computed-between-two-dates-in-panda) should work. You just need to aggregate the sum instead of the mean. (And make sure to `sort_values` before doing the aggregation) – ALollz Dec 17 '21 at 20:22
  • well what answer do you want to get? It's ambiguous if you'd want to group the values for the same day in the rolling and then have one value or still do a rolling sum where different rows for the same day would have different sums – ALollz Dec 17 '21 at 20:39
  • @ALollz - thanks. So I've tried `df['amount_4wk_rolling'] = df.reset_index().groupby(["ID"]).apply(lambda g: g.sort_values('Date').rolling("28d", on="Date").agg({'Amount': 'sum', 'index': 'max'}).reset_index(drop=True).set_index('index'))` but I get the error; `ValueError: cannot handle a non-unique multi-index!`. On closer inspection the code seems to return a multi-index dataframe rather than a series so I've well and truly cocked up. I'm just trying to recreate the exact answer without duplicate dates at this stage. – Jossy Dec 17 '21 at 20:55
  • I've also tried `df['amount_4wk_rolling'] = df.sort_values('Date').reset_index().groupby(["ID"]).rolling("28d", on="Date").agg({'Amount': 'sum', 'index': 'max'}).reset_index(drop=True).set_index('index')` but then I get `ValueError: cannot reindex from a duplicate axis` – Jossy Dec 17 '21 at 20:57
  • @ALollz - apologies if I confused things. Is there anyway you'd be able to take a look at my two attempts at adapting your code to my scenario? To answer your question around what answer I'm looking for - I'm just trying to replicate the output in my question at this stage, i.e. without duplicate dates. I'll then adapt the dataframe to include duplicate dates and test that. – Jossy Dec 20 '21 at 02:59
  • Is it important to use `rolling`? – Danish Bansal Dec 21 '21 at 19:21
  • @DanishBansal - hey. Not if I can get the same functionality with comparable speed on a dataframe of circa 3m rows! – Jossy Dec 21 '21 at 19:29
  • @DanishBansal - I've actually now managed to find a question that covers this. Typical! https://stackoverflow.com/questions/41642320/efficient-pandas-rolling-aggregation-over-date-range-by-group-python-2-7-windo. Prob best to stand down - I'll close this question when the bounty ends – Jossy Dec 21 '21 at 19:48

2 Answers2

1

Based on comments to the question, it appears that OP already found a solution. However, this is an attempt at providing another way to resolve this, which is to solve the root cause of the error - duplicate date values.

To resolve it, we can add aggregation by date within the apply. In the snippet below, the Amount values are aggregated using sum, but it is possible that in some contexts another aggregation should be used, e.g. min or max. This is the relevant part:

    .apply(
        lambda g: (
            g
            .groupby('Date', as_index=False)
            .agg({'Amount': 'sum'})
            .rolling('28d', on='Date')
            .sum()
        )
    )

And the full snippet below:

import pandas as pd
import datetime

df = pd.DataFrame(
    {
        'ID': {0: 10001, 1: 10001, 2: 10001, 3: 10001, 4: 10002, 5: 10002, 6: 10002},
        'Date': {
            0: datetime.datetime(2019, 7, 1),
            1: datetime.datetime(2019, 5, 1),
            2: datetime.datetime(2019, 6, 25),
            3: datetime.datetime(2019, 5, 27),
            4: datetime.datetime(2019, 6, 29),
            5: datetime.datetime(2019, 7, 18),
            6: datetime.datetime(2019, 7, 18)
        },
        'Amount': {0: 50, 1: 15, 2: 10, 3: 20, 4: 25, 5: 35, 6: 40},
    }
)

amounts = (
    df
    .groupby(["ID"])
    .apply(
        lambda g: (
            g
            .groupby('Date', as_index=False)
            .agg({'Amount': 'sum'})
            .rolling('28d', on='Date')
            .sum()
        )
    )
)

df['amount_4wk_rolling'] = df["Date"].map(amounts.set_index('Date')['Amount'])

# this yields
#       ID       Date  Amount  amount_4wk_rolling
# 0  10001 2019-07-01      50                60.0
# 1  10001 2019-05-01      15                15.0
# 2  10001 2019-06-25      10                10.0
# 3  10001 2019-05-27      20                35.0
# 4  10002 2019-06-29      25                25.0
# 5  10002 2019-07-18      35               100.0
# 6  10002 2019-07-18      40               100.0
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks for this! If I had two `Amount` columns; `Amount1` and `Amount2` - how would I apply the function to both of these without using 2x `map`? – Jossy Dec 29 '21 at 20:31
  • Thank you for the bonus! :) Hmm, I'm not sure that's possible. If you are dealing with large dataframes, it might be possible to speed it up a bit with `dask`, but it would still involve two mappings... – SultanOrazbayev Dec 30 '21 at 03:03
  • Ok thanks! I opted to `pd.merge` `amounts` back into `df` – Jossy Dec 31 '21 at 04:42
  • That makes sense, especially if it's possible to subset dfs by meaningful non-overlapping date ranges (e.g. only dates within a particular month), because then the dataframes are small. – SultanOrazbayev Dec 31 '21 at 05:15
1

The problem is the first level index of amounts:

>>> df
      ID       Date  Amount
0  10001 2019-07-01      50
1  10001 2019-05-01      15
2  10001 2019-06-25      10
3  10001 2019-05-27      20
4  10002 2019-06-29      25
5  10002 2019-07-18      35  # <- dup date
6  10002 2019-07-18      40  # <- dup date

>>> amounts
         Amount       Date       ID
ID                                 
10001 1    15.0 2019-05-01  10001.0
      3    35.0 2019-05-27  20002.0
      2    10.0 2019-06-25  10001.0
      0    60.0 2019-07-01  20002.0
10002 4    25.0 2019-06-29  10002.0
      5    60.0 2019-07-18  20004.0
      6   100.0 2019-07-18  30006.0

If you map amounts on Date columns to merge your data on df, you got your error because Pandas doesn't know which values it should use for 2019-07-18. If you look carefully the second level of index of amounts is the index of your original dataframe.

So if you drop the first level index set by groupby, you can use direct assignment:

df['amount_4wk_rolling'] = amounts.droplevel(0)['Amount']
print(df)

# Output:
      ID       Date  Amount  amount_4wk_rolling
0  10001 2019-07-01      50                60.0
1  10001 2019-05-01      15                15.0
2  10001 2019-06-25      10                10.0
3  10001 2019-05-27      20                35.0
4  10002 2019-06-29      25                25.0
5  10002 2019-07-18      35                60.0
6  10002 2019-07-18      40               100.0
Corralien
  • 109,409
  • 8
  • 28
  • 52