0

This is what my dataframe looks like. The two rightmost columns are my desired columns.

dt <- read.table(text='

    Name      ActivityType     ActivityDate  EmailLast21(Desired)     WebinarLast21(Desired)             
    John       Email            1/1/2014           1                  0  
    John       Webinar          1/5/2014           1                  1
    John       Webinar          1/20/2014          1                  2
    John       Webinar          3/25/2014          0                  1
    John       Email            4/1/2014           1                  1
    John       Email            4/20/2014          2                  0
    Tom        Email            1/1/2014           1                  0  
    Tom       Webinar           1/5/2014           1                  1
    Tom       Webinar           1/20/2014          1                  2
    Tom       Webinar           3/25/2014          0                  1
    Tom       Email             4/1/2014           1                  1
    Tom       Email             4/20/2014          2                  0

    ', header=T, row.names = NULL)

Based on a previous solution here:Events in last 21 days for every row by Name I am trying the following

require(data.table)
    setDT(dt)
    dt[, ActivityDate := as.Date(ActivityDate, format="%m/%d/%Y")]
    setkey(dt, Name, ActivityDate)

    roll_index <- function(x, types, roll=21) {
      lapply(types, function(type) {
        idx = x[ActivityType == type][x, roll=roll, which=TRUE]
        as.numeric(idx)
      })
    }
    dt[, c("Email_21", "Webinar_21") := roll_index(dt, c("Email", "Webinar"))]

But the output is not what I want as I used as.numeric function instead of something that will count the number of times webinar or email appears in the last 21 days window. Any help is much appreciated!

Community
  • 1
  • 1
gibbz00
  • 1,947
  • 1
  • 19
  • 31

2 Answers2

2

Cast email/webinar to separate variables.

dc <- dcast(dt, Name + ActivityDate ~ ActivityType, fun.aggregate=length)

Then apply the answer below to get your rolling sum of each one by the 21 day window:

https://stackoverflow.com/a/24400600/2573061

Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
1

A method which uses only base R and seems relatively straightforward could look like:

# ensure that ActivityType is a factor
  dt$ActivityType <- factor(dt$ActivityType)   
# convert character dates to Date types
  dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")  
# ensure that dt is ordered by Name and ActivityDate
  dt <- dt[order(dt$Name, dt$ActivityDate),]
# for each ActivityDate and Name, count the number of instances within a 21 day window for each ActivityType
  dt <- cbind(dt, t(sapply(split(dt, list(dt$ActivityDate, dt$Name)), 
                         FUN=function(y) { z= (dt$Name == y$Name & dt$ActivityDate <= y$ActivityDate & dt$ActivityDate > (y$ActivityDate -21));
                                      sapply(levels(dt$ActivityType), FUN=function(x) sum(z & dt$ActivityType == x )) } )) )

This version is somewhat faster and a bit simpler than the previous version.

EXPLANATION

I'll try to answer your question by going through the steps in the last statement although this makes the answer a bit long. I thought of dt in your problem as providing two sets of overlappoing data. First the Name and ActivityDates columns provide the set of names and dates for which results are required. Second the Name, ActivityDates, and ActivityTypes columns provide the data for computing the results. So, the code does the following

  1. sapply with split sets up the loop over all Name-ActvityDates pairs. In ths case, split breaks dt into a list where each element of the list is a data frame containing a row of dt. The code uses sapply-split since it passes the row as a data frame with all columns to its FUN argument. Rather than use a named function from an R package, the code defines a function using the usual function definition format i.e. function(argument) { statements using arguments }. In this case, the function has no name and is called an anonymous function.
  2. Now that we have set up the loop over Name-AcivityDates pairs, we will think of dt in the second way described above, as the source of data used to compute the results required for each row of dt. The function function(y) where y is the name of the argument for this function does this. The first statement in function(y) identifies all rows of dt which match the Name and the ActivityDate range of the input row y (a single-row data frame) where y$Name is the Name column and y$ActivityDate is the ActivityDate for the ActivityDate range we are trying to match. The results of this matching process are stored in z as a logical vector where a value of TRUE in the i-th element of z indicates that the i-th row of dt matches y while a value of FALSE means no match. z may contain matches for multiple values of ActivityType but it's important for the performance of the code that finding the rows of dt which match range defined by ActivityDates only be done once per row of dt.
  3. The second statement of function(y) counts the number of matches in z for each value of ActivityType. sapply loops over the unique values of ActivityType (e.g. Email and webinar), passes each value to the anonymous function function(x), finds the rows in dt which match x, and forms a logical vector with z containing only the matches for that Name, AcivityType, and range of dates. z is available to function(x) since it was defined in function(y), the parent environment of function(x). sum acting on a logical vector counts the number of TRUE elements which gives the counts reported in the results.
  4. sapply returns a matrix with ActivityType as rows and Name-ActivityDates as columns so t is used to form the transpose which is then bound to dt using cbind.

I tried to make the code have good performance and be comprehensible. If futher performance improvements are needed, the base R code can be made a bit faster or this approach could be moved to use either dplyr or data.table which might also help. Let me know if you have any more questions.

WaltS
  • 5,410
  • 2
  • 18
  • 24
  • Thanks a lot for your answer! – gibbz00 Dec 15 '15 at 19:31
  • Hi WaltS, your solution is very elegant. I was wondering if you could kindly break down the last step. When we are doing dt$Name == y$Name, what is "y" here? Is it each element of the list "split(dt, list(dt$ActivityDate, dt$Name)"? In that case how does y$Name return anything? I had the same question about "dt$ActivityType == x". Kindly let me know. – gibbz00 Dec 15 '15 at 23:48
  • See explanation above. – WaltS Dec 16 '15 at 15:30
  • Thanks so much for such a detailed answer! It would be awesome if you can guide me regarding moving this approach to data.table. – gibbz00 Dec 22 '15 at 16:00
  • Just curious but did you compare the execution times of this base R approach versus the `data.table` version in your later question? – WaltS Dec 30 '15 at 13:50
  • data.table took about 1.87 hours. I don't remember exactly but base R was over 3 hours. – gibbz00 Dec 30 '15 at 16:12