R: Rolling/sliding Window and Distinct count in R for sliding number of days

Question

I've following dataset

set.seed(1)
transaction_date <- sample(seq(as.Date('2016/01/01'), as.Date('2016/02/01'), by="day"), 24)
set.seed(1)
df <- data.frame("categ" = paste0("Categ",rep(1:2,12)), "prod" = sample(paste0("Prod",rep(seq(1:3),8))), customer_id = paste0("customer ",seq(1:24)),transaction_date=transaction_date)
df_ordered <- df[order(df$cate,df$prod,df$transaction_date,df$customer_id),]
df_ordered

categ  prod customer_id transaction_date
1  Categ1 Prod1  customer 1       2016-01-09
3  Categ1 Prod1  customer 3       2016-01-18
19 Categ1 Prod1 customer 19       2016-01-28
7  Categ1 Prod1  customer 7       2016-01-29
5  Categ1 Prod2  customer 5       2016-01-06
23 Categ1 Prod2 customer 23       2016-01-07
13 Categ1 Prod2 customer 13       2016-01-14
9  Categ1 Prod2  customer 9       2016-01-16
15 Categ1 Prod2 customer 15       2016-01-20
21 Categ1 Prod2 customer 21       2016-01-24
11 Categ1 Prod3 customer 11       2016-01-05
17 Categ1 Prod3 customer 17       2016-01-31
10 Categ2 Prod1 customer 10       2016-01-02
20 Categ2 Prod1 customer 20       2016-01-11
24 Categ2 Prod1 customer 24       2016-01-23
16 Categ2 Prod1 customer 16       2016-02-01
12 Categ2 Prod2 customer 12       2016-01-04
4  Categ2 Prod2  customer 4       2016-01-27
22 Categ2 Prod3 customer 22       2016-01-03
14 Categ2 Prod3 customer 14       2016-01-08
2  Categ2 Prod3  customer 2       2016-01-12
18 Categ2 Prod3 customer 18       2016-01-15
8  Categ2 Prod3  customer 8       2016-01-17
6  Categ2 Prod3  customer 6       2016-01-25

I've to do count over unique customer in a window of 12 days from the first(minimum) observed transaction_date for a group on categ, prod.

Sliding of window for 12 days ahead of current transaction date and count of all transaction coming in that bucket. Below is the output i'm trying to create. I want to avoid for loop for this task.

Possible duplicate of [relative windowed running sum through data.table non-equi join](http://stackoverflow.com/questions/41007099/relative-windowed-running-sum-through-data-table-non-equi-join) — ExperimenteR, Dec 16 '16 at 12:08

mpjdem · Accepted Answer · 2016-12-16T12:36:05.603

Using dplyr and rollapply from zoo this can be achieved. First we fill in all missing dates for all groups so we have a contiguous series, using expand.grid and merge. Then we group by category and product, arrange by date, and apply a rolling window to the values in customer ids. The function we define to be applied at each step takes the length of the vector of unique values, with NAs removed. At the end we filter out the added dates again, where customer_id was not available.

library(dplyr)
library(zoo)

set.seed(1)
transaction_date <- sample(seq(as.Date('2016/01/01'), as.Date('2016/02/01'), by="day"), 24)
set.seed(1)
df <- data.frame("categ" = paste0("Categ",rep(1:2,12)), "prod" = sample(paste0("Prod",rep(seq(1:3),8))), customer_id = paste0("customer ",seq(1:24)),transaction_date=transaction_date)

all_combinations <- expand.grid(categ=unique(df$categ), 
        prod=unique(df$prod), 
        transaction_date=seq(min(df$transaction_date), max(df$transaction_date), by="day"))

df <- merge(df, all_combinations, by=c('categ','prod','transaction_date'), all=TRUE)

res <- df %>% 
       group_by(categ, prod) %>% 
       arrange(transaction_date) %>% 
       mutate(ucust=rollapply(customer_id, width=12, FUN=function(x) length(unique(x[!is.na(x)])), partial=TRUE, align='left')) %>%
       filter(!is.na(customer_id))

Sorry I was a bit too quick there; existing dates were duplicated. I corrected it now. — mpjdem, Dec 16 '16 at 12:38

R: Rolling/sliding Window and Distinct count in R for sliding number of days

1 Answers1