A method which uses only base R and seems relatively straightforward could look like:
# ensure that ActivityType is a factor
dt$ActivityType <- factor(dt$ActivityType)
# convert character dates to Date types
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
# ensure that dt is ordered by Name and ActivityDate
dt <- dt[order(dt$Name, dt$ActivityDate),]
# for each ActivityDate and Name, count the number of instances within a 21 day window for each ActivityType
dt <- cbind(dt, t(sapply(split(dt, list(dt$ActivityDate, dt$Name)),
FUN=function(y) { z= (dt$Name == y$Name & dt$ActivityDate <= y$ActivityDate & dt$ActivityDate > (y$ActivityDate -21));
sapply(levels(dt$ActivityType), FUN=function(x) sum(z & dt$ActivityType == x )) } )) )
This version is somewhat faster and a bit simpler than the previous version.
EXPLANATION
I'll try to answer your question by going through the steps in the last statement although this makes the answer a bit long. I thought of dt
in your problem as providing two sets of overlappoing data. First the Name
and ActivityDates
columns provide the set of names and dates for which results are required. Second the Name
, ActivityDates
, and ActivityTypes
columns provide the data for computing the results. So, the code does the following
sapply
with split
sets up the loop over all Name-ActvityDates
pairs. In ths case, split
breaks dt
into a list where each element of the list is a data frame containing a row of dt
. The code uses sapply-split
since it passes the row as a data frame with all columns to its FUN
argument. Rather than use a named function from an R package, the code defines a function using the usual function definition format i.e. function(argument) { statements using arguments }. In this case, the function has no name and is called an anonymous function.
- Now that we have set up the loop over
Name-AcivityDates
pairs, we will think of dt
in the second way described above, as the source of data used to compute the results required for each row of dt
. The function function(y)
where y
is the name of the argument for this function does this. The first statement in function(y)
identifies all rows of dt
which match the Name
and the ActivityDate
range of the input row y
(a single-row data frame) where y$Name
is the Name
column and y$ActivityDate
is the ActivityDate
for the ActivityDate
range we are trying to match. The results of this matching process are stored in z
as a logical vector where a value of TRUE
in the i-th
element of z
indicates that the i-th
row of dt
matches y
while a value of FALSE
means no match. z
may contain matches for multiple values of ActivityType
but it's important for the performance of the code that finding the rows of dt
which match range defined by ActivityDates
only be done once per row of dt
.
- The second statement of
function(y)
counts the number of matches in z
for each value of ActivityType
. sapply
loops over the unique values of ActivityType
(e.g. Email and webinar), passes each value to the anonymous function function(x)
, finds the rows in dt
which match x
, and forms a logical vector with z
containing only the matches for that Name
, AcivityType
, and range of dates. z
is available to function(x)
since it was defined in function(y)
, the parent environment of function(x)
. sum
acting on a logical vector counts the number of TRUE
elements which gives the counts reported in the results.
sapply
returns a matrix with ActivityType
as rows and Name-ActivityDates
as columns so t
is used to form the transpose which is then bound to dt
using cbind
.
I tried to make the code have good performance and be comprehensible. If futher performance improvements are needed, the base R code can be made a bit faster or this approach could be moved to use either dplyr
or data.table
which might also help. Let me know if you have any more questions.