7

I'm having an issue working out a rolling count of transactions applicable to each individual buyer in this dataset structured as follows:

userID  itemID      transaction_ts
3229    4493320     2016-01-02 14:55:00
3229    4492492     2016-01-02 14:57:02
3229    4496756     2016-01-04 09:01:18
3229    4493673     2016-01-04 09:11:10
3229    4497531     2016-01-04 11:05:25
3229    4495006     2016-01-05 07:25:11
4330    4500695     2016-01-02 09:17:21
4330    4500656     2016-01-03 09:19:28
4330    4503087     2016-01-04 07:42:15
4330    4501846     2016-01-04 08:55:24
4330    4504105     2016-01-04 09:59:35

Ideally, it would look like the below for a rolling transaction count window of e.g. 24 hours:

userID  itemID      transaction_ts        rolling_count
3229    4493320     2016-01-02 14:55:00         1
3229    4492492     2016-01-02 14:57:02         2
3229    4496756     2016-01-04 09:01:18         1
3229    4493673     2016-01-04 09:11:10         2
3229    4497531     2016-01-04 11:05:25         3
3229    4495006     2016-01-05 07:25:11         4
4330    4500695     2016-01-02 09:17:21         1
4330    4500656     2016-01-03 09:19:28         1
4330    4503087     2016-01-04 07:42:15         2 
4330    4501846     2016-01-04 08:55:24         3
4330    4504105     2016-01-04 09:59:35         3

There is an excellent answer to a similar problem here: pandas rolling sum of last five minutes

However, this answer depends solely on the timestamp field, unlike the above where the rolling count must reset to 1 upon encountering a transaction from a different user to that of the row above. It is possible to find a solution via slicing but given the size of this dataset (potentially 1m+ rows) that is not feasible.

Crucially, the window should reflect the 24 hour period prior to the transactional_ts of the respective row, hence why I think a custom df.apply or rolling_window method is appropriate, I just can't figure out how to make that conditional on the userID.

Community
  • 1
  • 1
tompiler
  • 321
  • 2
  • 11
  • You can use `df.groupby('userID')` and then apply (using e.g. `.transform()`) your custom rolling function. Aggragation/transformation can still take a while with a large data frame. BTW, are you going to resample your df with 1 min frequency? – ptrj May 13 '16 at 22:48
  • You're right, I think the answer must involve a df.groupby('userID') and some custom function like that in the linked solution. I think resampling to 1min would just make the dataframe humongous, probably better to resample to 1 day and see what I can do. I kinda wanted it to work off each individual transaction ts as opposed to some arbitrary time interval chosen for computational brevity though. Thanks! – tompiler May 15 '16 at 09:00
  • I just don't know how tu upsample without distorting the data. But look at my answer for a different approach. – ptrj May 15 '16 at 16:02

2 Answers2

4

A part of the solution (a rolling cumsum) may already be here. (I only changed the type of lag):

from datetime import timedelta

def msum(s, lag):
    lag = s.index - timedelta(days=lag)
    inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
    cs = s.cumsum()
    return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)

The function requires an index to be of datetime type. Moreover, the index within each userID group should be already sorted (for instance as in your example).

df = df.set_index('transaction_ts')
df['rolling_count'] = 1
df['rolling_count'] = df.groupby('userID', sort=False)['rolling_count'].transform(lambda x : msum(x,1))

A groupby option sort=False may give some speed up. (It's responsible for sorting group keys.)

Community
  • 1
  • 1
ptrj
  • 5,152
  • 18
  • 31
  • how would you tweak this method to get the number of unique values within a window instead of the cumulative sum? – Lauren May 10 '17 at 01:08
  • this might answer my question: df.groupby('UserID', sort=False)['column_name'].transform(lambda x : x.rolling('3D').apply(lambda x: len(np.unique(x)))) – Lauren May 11 '17 at 20:28
0

I managed to get a solution, which works on the test set at least. ptjr got there first though! The first solution on this problem Pandas Rolling Computations on Sliding Windows (Unevenly spaced) helped out a lot.

As ptrj earlier pointed out - using df.groupby('userID') is the key.

df = pd.read_excel('velocity.xlsx') # reading dataframe in
df = df.sort_values(['userID','transaction_ts'])
df = df.reset_index(drop=True) # ensure index is sorted according to userID|transaction_ts
df['ones'] = 1

def add_rolling_count(x,number_of_hours):
    x['lag'] = x['transaction_ts'] - timedelta(hours=number_of_hours)
    inds = np.searchsorted(np.array(x['transaction_ts'].astype(np.int64)),   np.array(x['lag'].astype(np.int64)))
    cs = x['ones'].reset_index(drop=True).cumsum()
    x['count'] = cs.values - cs[inds].values + 1
    return x`

df = df.groupby('user_id').apply(lambda x: add_rolling_count(x, 24))
Community
  • 1
  • 1
tompiler
  • 321
  • 2
  • 11