3

My dataframe looks like this. The two rightmost columns are my desired columns.

**Name      ActivityType     ActivityDate   Email(last 21 says)  Webinar(last21)**             
John       Email            1/1/2014        NA                   NA   
John       Webinar          1/5/2014        NA                   NA
John       Sale             1/20/2014       Yes                  Yes
John       Webinar          3/25/2014       NA                   NA
John       Sale             4/1/2014        No                   Yes
John       Sale             7/1/2014        No                   No
Tom        Email            1/1/2015        NA                   NA   
Tom        Webinar          1/5/2015        NA                   NA
Tom        Sale             1/20/2015       Yes                  Yes
Tom        Webinar          3/25/2015       NA                   NA
Tom        Sale              4/1/2015        No                   Yes
Tom        Sale              7/1/2015        No                   No                

I am just trying to create a yes/no variable that denotes whether there was an email or a webinar in the last 21 days for each "Sale" transaction. I was thinking(mock code) along the lines of using dplyr this way:

custlife %>% 
group_by(Name) %>% 
 mutate(Email(last21days)=lag(ifelse(ActivityType = "Email" & ActivityDate of email within (activity date of sale - 21),Yes,No)).

I am not sure of the way to implement this. Kindly help. Your help is sincerely appreciated!

gibbz00
  • 1,947
  • 1
  • 19
  • 31
  • if you had all days, for example 1/1/2014, 2/1/2014, 3/1/2014 ... then you could work on `cumsum(Email)` and `lag(x, k = 21)`. By the way, in context of above question, NA Email rows can be skipped. Other inefficient way would be to generate a set of days that had a preceding Email no more than 21 days earlier or the same day. Then do a join with original set. – Pafnucy May 25 '15 at 18:38
  • Why `Webinar` rows also have "`yes`" on them, aren't you checking just the `Sale` rows? – David Arenburg May 25 '15 at 18:49
  • @DavidArenburg You are right. I made an error while I was making the example. Thank you for pointing that out. – gibbz00 May 25 '15 at 21:38
  • where are the data?? – jangorecki May 25 '15 at 21:52
  • @jangorecki what do you mean? The data is in the question – David Arenburg May 25 '15 at 21:57
  • @DavidArenburg in OP question there is not data but the way data looks like ;) – jangorecki May 25 '15 at 22:02
  • @jangorecki I am very new to stackoverflow. Kindly let me know of an easy way to post the data in stackoverflow. – gibbz00 May 25 '15 at 22:43
  • @gibbz00 `?dput`, on your data: `dput(custlife[1:20])` – jangorecki May 25 '15 at 23:22

2 Answers2

5

Here's a possible data.table solution. Here I'm creating 2 temporary data sets- one for Sale and one for the rest of activity types and then joining between them by a rolling window of 21 while using by = .EACHI in order to check conditions in each join. Then, I'm joining the result to the original data set.

Convert the date column to Date class and key the data by Name and Date (for the final/rolling join)

library(data.table)
setkey(setDT(df)[, ActivityDate := as.IDate(ActivityDate, "%m/%d/%Y")], Name, ActivityDate)

Create 2 temporary data sets per each activity

Saletemp <- df[ActivityType == "Sale", .(Name, ActivityDate)]
Elsetemp <- df[ActivityType != "Sale", .(Name, ActivityDate, ActivityType)]

Join by a rolling window of 21 to the sales temporary data set while checking conditions

Saletemp[Elsetemp, `:=`(Email21 = as.logical(which(i.ActivityType == "Email")), 
                        Webinar21 = as.logical(which(i.ActivityType == "Webinar"))), 
         roll = -21, by = .EACHI]

Join everything back

df[Saletemp, `:=`(Email21 = i.Email21, Webinar21 = i.Webinar21)]
df
#     Name ActivityType ActivityDate Email21 Webinar21
#  1: John        Email   2014-01-01      NA        NA
#  2: John      Webinar   2014-01-05      NA        NA
#  3: John         Sale   2014-01-20    TRUE      TRUE
#  4: John      Webinar   2014-03-25      NA        NA
#  5: John         Sale   2014-04-01      NA      TRUE
#  6: John         Sale   2014-07-01      NA        NA
#  7:  Tom        Email   2015-01-01      NA        NA
#  8:  Tom      Webinar   2015-01-05      NA        NA
#  9:  Tom         Sale   2015-01-20    TRUE      TRUE
# 10:  Tom      Webinar   2015-03-25      NA        NA
# 11:  Tom         Sale   2015-04-01      NA      TRUE
# 12:  Tom         Sale   2015-07-01      NA        NA
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
2

Here is another option with base R:

df is first split according to Name and then, among each subset, for each Sale, it looks if there is an Email (Webinar) within 21 days from the Sale. Finally, the list is unsplit according to Name.
You just have to replace FALSE by no and TRUE by yes afterwards.

df_split <- split(df, df$Name)

df_split <- lapply(df_split, function(tab){
                                i_s <- which(tab[,2]=="Sale")
                                tab$Email21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Email", 3] >= d_s-21)})
                                tab$Webinar21[i_s] <- sapply(tab[i_s, 3], function(d_s){any(tab[tab$ActivityType=="Webinar", 3] >= d_s-21)})
                                tab
                              })
df_res <- unsplit(df_split, df$Name)

df_res
#   Name ActivityType ActivityDate Email21 Webinar21
#1  John        Email   2014-01-01      NA        NA
#2  John      Webinar   2014-01-05      NA        NA
#3  John         Sale   2014-01-20    TRUE      TRUE
#4  John      Webinar   2014-03-25      NA        NA
#5  John         Sale   2014-04-01   FALSE      TRUE
#6  John         Sale   2014-07-01   FALSE     FALSE
#7   Tom        Email   2015-01-01      NA        NA
#8   Tom      Webinar   2015-01-05      NA        NA
#9   Tom         Sale   2015-01-20    TRUE      TRUE
#10  Tom      Webinar   2015-03-25      NA        NA
#11  Tom         Sale   2015-04-01   FALSE      TRUE
#12  Tom         Sale   2015-07-01   FALSE     FALSE

data

df <- structure(list(Name = c("John", "John", "John", "John", "John", 
"John", "Tom", "Tom", "Tom", "Tom", "Tom", "Tom"), ActivityType = c("Email", 
"Webinar", "Sale", "Webinar", "Sale", "Sale", "Email", "Webinar", 
"Sale", "Webinar", "Sale", "Sale"), ActivityDate = structure(c(16071, 
16075, 16090, 16154, 16161, 16252, 16436, 16440, 16455, 16519, 
16526, 16617), class = "Date")), .Names = c("Name", "ActivityType", 
"ActivityDate"), row.names = c(NA, -12L), index = structure(integer(0), ActivityType = c(1L, 
7L, 3L, 5L, 6L, 9L, 11L, 12L, 2L, 4L, 8L, 10L)), class = "data.frame")
Cath
  • 23,906
  • 5
  • 52
  • 86