Counting number of times Activity Types appear in the last 21 day Window

Question

This is what my dataframe looks like. The two rightmost columns are my desired columns.

dt <- read.table(text='

    Name      ActivityType     ActivityDate  EmailLast21(Desired)     WebinarLast21(Desired)             
    John       Email            1/1/2014           1                  0  
    John       Webinar          1/5/2014           1                  1
    John       Webinar          1/20/2014          1                  2
    John       Webinar          3/25/2014          0                  1
    John       Email            4/1/2014           1                  1
    John       Email            4/20/2014          2                  0
    Tom        Email            1/1/2014           1                  0  
    Tom       Webinar           1/5/2014           1                  1
    Tom       Webinar           1/20/2014          1                  2
    Tom       Webinar           3/25/2014          0                  1
    Tom       Email             4/1/2014           1                  1
    Tom       Email             4/20/2014          2                  0

    ', header=T, row.names = NULL)

Based on a previous solution here:Events in last 21 days for every row by Name I am trying the following

require(data.table)
    setDT(dt)
    dt[, ActivityDate := as.Date(ActivityDate, format="%m/%d/%Y")]
    setkey(dt, Name, ActivityDate)

    roll_index <- function(x, types, roll=21) {
      lapply(types, function(type) {
        idx = x[ActivityType == type][x, roll=roll, which=TRUE]
        as.numeric(idx)
      })
    }
    dt[, c("Email_21", "Webinar_21") := roll_index(dt, c("Email", "Webinar"))]

But the output is not what I want as I used as.numeric function instead of something that will count the number of times webinar or email appears in the last 21 days window. Any help is much appreciated!

have a look at `cumsum`. Make a column/index of one each time you need to count and run `cumsum` on that. *Then* apply it in a moving average type situation. — MikeRSpencer, Dec 14 '15 at 19:23
@MikeRSpencer Thanks. I would use cumsum if I did not have that rolling 21 day restriction. I dont know how to incorporate time window restriction in the context of using cumsum. — gibbz00, Dec 14 '15 at 19:31
Once you've got the first part working apply it to a subset based on the date of that line minus 21. — MikeRSpencer, Dec 14 '15 at 19:33

score 2 · Answer 1 · edited May 23 '17 at 12:23

2

Cast email/webinar to separate variables.

dc <- dcast(dt, Name + ActivityDate ~ ActivityType, fun.aggregate=length)

Then apply the answer below to get your rolling sum of each one by the 21 day window:

https://stackoverflow.com/a/24400600/2573061

edited May 23 '17 at 12:23

Community

1
1

answered Dec 14 '15 at 20:01

C8H10N4O2

18,312
8
98
134

Thanks a lot for your answer! – gibbz00 Dec 14 '15 at 21:33

WaltS · Accepted Answer · 2015-12-16T15:29:56.800

A method which uses only base R and seems relatively straightforward could look like:

# ensure that ActivityType is a factor
  dt$ActivityType <- factor(dt$ActivityType)   
# convert character dates to Date types
  dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")  
# ensure that dt is ordered by Name and ActivityDate
  dt <- dt[order(dt$Name, dt$ActivityDate),]
# for each ActivityDate and Name, count the number of instances within a 21 day window for each ActivityType
  dt <- cbind(dt, t(sapply(split(dt, list(dt$ActivityDate, dt$Name)), 
                         FUN=function(y) { z= (dt$Name == y$Name & dt$ActivityDate <= y$ActivityDate & dt$ActivityDate > (y$ActivityDate -21));
                                      sapply(levels(dt$ActivityType), FUN=function(x) sum(z & dt$ActivityType == x )) } )) )

This version is somewhat faster and a bit simpler than the previous version.

EXPLANATION

I'll try to answer your question by going through the steps in the last statement although this makes the answer a bit long. I thought of dt in your problem as providing two sets of overlappoing data. First the Name and ActivityDates columns provide the set of names and dates for which results are required. Second the Name, ActivityDates, and ActivityTypes columns provide the data for computing the results. So, the code does the following

sapply with split sets up the loop over all Name-ActvityDates pairs. In ths case, split breaks dt into a list where each element of the list is a data frame containing a row of dt. The code uses sapply-split since it passes the row as a data frame with all columns to its FUN argument. Rather than use a named function from an R package, the code defines a function using the usual function definition format i.e. function(argument) { statements using arguments }. In this case, the function has no name and is called an anonymous function.
Now that we have set up the loop over Name-AcivityDates pairs, we will think of dt in the second way described above, as the source of data used to compute the results required for each row of dt. The function function(y) where y is the name of the argument for this function does this. The first statement in function(y) identifies all rows of dt which match the Name and the ActivityDate range of the input row y (a single-row data frame) where y$Name is the Name column and y$ActivityDate is the ActivityDate for the ActivityDate range we are trying to match. The results of this matching process are stored in z as a logical vector where a value of TRUE in the i-th element of z indicates that the i-th row of dt matches y while a value of FALSE means no match. z may contain matches for multiple values of ActivityType but it's important for the performance of the code that finding the rows of dt which match range defined by ActivityDates only be done once per row of dt.
The second statement of function(y) counts the number of matches in z for each value of ActivityType. sapply loops over the unique values of ActivityType (e.g. Email and webinar), passes each value to the anonymous function function(x), finds the rows in dt which match x, and forms a logical vector with z containing only the matches for that Name, AcivityType, and range of dates. z is available to function(x) since it was defined in function(y), the parent environment of function(x). sum acting on a logical vector counts the number of TRUE elements which gives the counts reported in the results.
sapply returns a matrix with ActivityType as rows and Name-ActivityDates as columns so t is used to form the transpose which is then bound to dt using cbind.

I tried to make the code have good performance and be comprehensible. If futher performance improvements are needed, the base R code can be made a bit faster or this approach could be moved to use either dplyr or data.table which might also help. Let me know if you have any more questions.

Hi WaltS, your solution is very elegant. I was wondering if you could kindly break down the last step. When we are doing dt$Name == y$Name, what is "y" here? Is it each element of the list "split(dt, list(dt$ActivityDate, dt$Name)"? In that case how does y$Name return anything? I had the same question about "dt$ActivityType == x". Kindly let me know. — gibbz00, Dec 15 '15 at 23:48
Thanks so much for such a detailed answer! It would be awesome if you can guide me regarding moving this approach to data.table. — gibbz00, Dec 22 '15 at 16:00
Just curious but did you compare the execution times of this base R approach versus the `data.table` version in your later question? — WaltS, Dec 30 '15 at 13:50
data.table took about 1.87 hours. I don't remember exactly but base R was over 3 hours. — gibbz00, Dec 30 '15 at 16:12

Counting number of times Activity Types appear in the last 21 day Window

2 Answers2