18

I'm trying to find an efficient way to generate rolling counts or sums in pandas given a grouping and a date range. Eventually, I want to be able to add conditions, ie. evaluating a 'type' field, but I'm not there just yet. I've written something to get the job done, but feel that there could be a more direct way of getting to the desired result.

My pandas data frame currently looks like this, with the desired output being put in the last column 'rolling_sales_180'.

    name       date  amount  rolling_sales_180
0  David 2015-01-01     100              100.0
1  David 2015-01-05     500              600.0
2  David 2015-05-30      50              650.0
3  David 2015-07-25      50              100.0
4   Ryan 2014-01-04     100              100.0
5   Ryan 2015-01-19     500              500.0
6   Ryan 2016-03-31      50               50.0
7    Joe 2015-07-01     100              100.0
8    Joe 2015-09-09     500              600.0
9    Joe 2015-10-15      50              650.0

My current solution and environment can be sourced below. I've been modeling my solution from this R Q&A in stackoverflow. Efficient way to perform running total in the last 365 day window

import pandas as pd
import numpy as np 

def trans_date_to_dist_matrix(date_col):  #  used to create a distance matrix
    x = date_col.tolist()
    y = date_col.tolist()
    data = []
    for i in x:
        tmp = []
        for j in y:
            tmp.append(abs((i - j).days))
        data.append(tmp)
        del tmp

    return pd.DataFrame(data=data, index=date_col.values, columns=date_col.values)


def lower_tri(x_col, date_col, win):  # x_col = column user wants a rolling sum of ,date_col = dates, win = time window
    dm = trans_date_to_dist_matrix(date_col=date_col)  # dm = distance matrix
    dm = dm.where(dm <= win)  # find all elements of the distance matrix that are less than window(time)
    lt = dm.where(np.tril(np.ones(dm.shape)).astype(np.bool))  # lt = lower tri of distance matrix so we get only future dates
    lt[lt >= 0.0] = 1.0  # cleans up our lower tri so that we can sum events that happen on the day we are evaluating
    lt = lt.fillna(0)  # replaces NaN with 0's for multiplication
     return pd.DataFrame(x_col.values * lt.values).sum(axis=1).tolist()


def flatten(x):
    try:
        n = [v for sl in x for v in sl]
        return [v for sl in n for v in sl]
    except:
        return [v for sl in x for v in sl]


data = [
['David', '1/1/2015', 100], ['David', '1/5/2015', 500], ['David', '5/30/2015', 50], ['David', '7/25/2015', 50],
['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], ['Ryan', '3/31/2016', 50],
['Joe', '7/1/2015', 100], ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]
]

list_of_vals = []

dates_df = pd.DataFrame(data=data, columns=['name', 'date', 'amount'], index=None)
dates_df['date'] = pd.to_datetime(dates_df['date'])
list_of_vals.append(dates_df.groupby('name', as_index=False).apply(
lambda x: lower_tri(x_col=x.amount, date_col=x.date, win=180)))

new_data = flatten(list_of_vals)
dates_df['rolling_sales_180'] = new_data

print dates_df

Your time and feedback are appreciated.

rs311
  • 353
  • 1
  • 2
  • 12
  • Are you sure your sample output for 'rolling_sales_180' is correct? That column is supposed to be a 180 rolling sum, right? All of the dates for Ryan are over a year apart, but they're still being summed? And all dates for Joe are within 180 days and aren't being summed? Did you somehow switch the two? – root Jan 13 '17 at 20:41
  • @root - I may have transposed incorrectly - I apologize. – rs311 Jan 13 '17 at 20:56
  • No worries, just making sure my understanding of the problem is correct. – root Jan 13 '17 at 20:56
  • @JohnE - I had done some research related to groupby and rolling, but everything seemed related to time-series data with even sequences. I hadn't seen anything like the solution suggested below. – rs311 Jan 13 '17 at 21:01
  • The Rollins_sales_180 values for Ryan should be 100, 500 & 50. – gibbz00 Mar 15 '18 at 17:49

1 Answers1

21

Pandas has support for time-aware rolling via the rolling method, so you can use that instead of writing your own solution from scratch:

def get_rolling_amount(grp, freq):
    return grp.rolling(freq, on='date')['amount'].sum()

df['rolling_sales_180'] = df.groupby('name', as_index=False, group_keys=False) \
                            .apply(get_rolling_amount, '180D')

The resulting output:

    name       date  amount  rolling_sales_180
0  David 2015-01-01     100              100.0
1  David 2015-01-05     500              600.0
2  David 2015-05-30      50              650.0
3  David 2015-07-25      50              100.0
4   Ryan 2014-01-04     100              100.0
5   Ryan 2015-01-19     500              500.0
6   Ryan 2016-03-31      50               50.0
7    Joe 2015-07-01     100              100.0
8    Joe 2015-09-09     500              600.0
9    Joe 2015-10-15      50              650.0
root
  • 32,715
  • 6
  • 74
  • 87
  • this is a great solution. thank you for it and insight into the time-aware rolling. I've used the rolling method in the past, but never in this way. – rs311 Jan 13 '17 at 21:03
  • mutating inside of an applied function is really non idiomatic – Jeff Jan 13 '17 at 21:10
  • you can also use a more direct syntax: http://pandas-docs.github.io/pandas-docs-travis/groupby.html#new-syntax-to-window-and-resample-operations – Jeff Jan 13 '17 at 21:11
  • specifying default parameters to .groupby is a bit distracting – Jeff Jan 13 '17 at 21:13
  • @Jeff: I tried using the direct method: `df.groupby('name').rolling('180D', on='date')['amount'].sum()` but get an error `ValueError: date must be monotonic`. Is that expected? Or is my syntax wrong? I realize the dates as a whole aren't monotonic, but within each group they are. Sorting by date first doesn't seem to help. – root Jan 13 '17 at 21:14
  • hmm, I think that is a bug; we don't need to validate when grouping. – Jeff Jan 13 '17 at 21:19
  • Thanks, beat me to it! – root Jan 13 '17 at 21:27
  • what a great answer – user88484 Nov 16 '22 at 08:33