1

This is a problem I've been trying to solve for a few days now. I think I may have to delete some data or something, but I'm honestly not sure. I have some data that looks like this:

email   Action  ActionType    TD    cnt     Date_Time
aaaa    Company trial          TD   1   10/12/14 19:17
aaaa    Task    Call           0    NA  10/13/14 17:00
bbbb    Task    Call           0    NA  12/9/14 16:17
bbbb    Task    Call           0    NA  12/9/14 16:17
bbbb    Task    Call           0    NA  12/10/14 16:31
bbbb    Task    Call           0    NA  12/12/14 16:45
bbbb    Company demo           TD   1   12/12/14 17:17
bbbb    Event   Demo           TD   2   2/9/15 15:09
cccc    Company trial          TD   1   8/18/14 14:28
cccc    Company demo           TD   2   8/20/14 13:21
cccc    Event   Demo           TD   3   2/9/15 15:08
dddd    Company trial          TD   1   12/14/14 0:09
eeee    Company demo           TD   1   8/27/14 21:57
eeee    Event   Demo           TD   2   2/9/15 15:08
eeee    Event   Demo           TD   3   2/9/15 15:08
ffff    Company trial          TD   1   3/19/14 21:15
gggg    Company trial          TD   1   7/30/14 18:06
hhhh    Company trial          TD   1   4/3/14 0:26
iiiii   Company trial          TD   1   5/29/14 20:10
iiiii   Task    Call           0    NA  5/29/14 22:01
jjjjj   Task    Call           0    NA  10/15/14 19:46
jjjjj   Company trial          TD   1   11/12/14 19:05
jjjjj   Task    Call           0    NA  11/12/14 19:16
jjjjj   Task    Call           0    NA  11/12/14 19:16
jjjjj   Task    Call           0    NA  11/12/14 19:31
jjjjj   Task    Call           0    NA  11/12/14 22:10
jjjjj   Task    Call           0    NA  11/13/14 19:46
jjjjj   Task    Call           0    NA  11/26/14 17:31
jjjjj   Task    Call           0    NA  11/26/14 17:31
jjjjj   Task    Call           0    NA  11/26/14 17:31
jjjjj   Task    Call           0    NA  11/26/14 17:31
kkkk    Company trial          TD   1   1/10/14 3:37
kkkk    Task    Call           0    NA  10/24/14 0:06
kkkk    Task    Call           0    NA  10/24/14 0:06
kkkk    Task    Call           0    NA  10/24/14 13:30
kkkk    Company trial          TD   2   10/27/14 12:45
kkkk    Task    Call           0    NA  1/23/15 14:31
kkkk    Task    Call           0    NA  1/26/15 21:15
kkkk    Company Trial          TD   3   1/27/15 21:15

The goal is to calculate the time difference between a demo or trial and previous calls. For example, I need to find the first demo/trial by email address, then look back calculate the difference between that demo/trial and the call before and then the difference between that call and the call before it and so forth.

I don't care about any calls after that first demo/trial unless after a few calls there is another demo/trial then the process should start again at that second demo/trial and calculate the difference between that second demo/trial and previous calls. I have the column "TD" to denote that row has a demo/trial. The "cnt" column is the number of that TD that occurs within that email address. For example if there are two trials back to back for the same email there will be a 1 and then 2 in the "cnt" column for that email address.

So basically I want the data to look like this:

email   Action  ActionType  TD  cnt     Date_Time   Time_Diff
aaaa    Company trial       TD  1   10/12/14 19:17  
aaaa    Task    Call        0   NA  10/13/14 17:00  
bbbb    Task    Call        0   NA  12/9/14 16:17   
bbbb    Task    Call        0   NA  12/9/14 16:17   0
bbbb    Task    Call        0   NA  12/10/14 16:31  1 d 14 m
bbbb    Task    Call        0   NA  12/12/14 16:45  2 d 14 m
bbbb    Company demo        TD  1   12/12/14 17:17  32 m
bbbb    Event   Demo        TD  2   2/9/15 15:09    
cccc    Company trial       TD  1   8/18/14 14:28   
cccc    Company demo        TD  2   8/20/14 13:21   
cccc    Event   Demo        TD  3   2/9/15 15:08    
dddd    Company trial       TD  1   12/14/14 0:09   
eeee    Company demo        TD  1   8/27/14 21:57   
eeee    Event   Demo        TD  2   2/9/15 15:08    
eeee    Event   Demo        TD  3   2/9/15 15:08    
ffff    Company trial       TD  1   3/19/14 21:15   
gggg    Company trial       TD  1   7/30/14 18:06   
hhhh    Company trial       TD  1   4/3/14 0:26 
iiiii   Company trial       TD  1   5/29/14 20:10   
iiiii   Task    Call        0   NA  5/29/14 22:01   
jjjjj   Task    Call        0   NA  10/15/14 19:46  
jjjjj   Company trial       TD  1   11/12/14 19:05  27 d, 23 h, 19 m
jjjjj   Task    Call        0   NA  11/12/14 19:16  
jjjjj   Task    Call        0   NA  11/12/14 19:16  
jjjjj   Task    Call        0   NA  11/12/14 19:31  
jjjjj   Task    Call        0   NA  11/12/14 22:10  
jjjjj   Task    Call        0   NA  11/13/14 19:46  
jjjjj   Task    Call        0   NA  11/26/14 17:31  
jjjjj   Task    Call        0   NA  11/26/14 17:31  
jjjjj   Task    Call        0   NA  11/26/14 17:31  
jjjjj   Task    Call        0   NA  11/26/14 17:31  
kkkk    Company trial       TD  1   1/10/14 3:37    
kkkk    Task    Call        0   NA  10/24/14 0:06   
kkkk    Task    Call        0   NA  10/24/14 0:06   0
kkkk    Task    Call        0   NA  10/24/14 13:30  13 h, 24 m
kkkk    Company trial       TD  2   10/27/14 12:45  2 d, 23 h, 15 m
kkkk    Task    Call        0   NA  1/23/15 14:31   
kkkk    Task    Call        0   NA  1/26/15 21:15   3 d, 6 h, 44 m
kkkk    Company trial       TD  3   1/27/15 21:15   1 d

It doesn't really matter to me how the time difference is formatted.

Hillary
  • 785
  • 1
  • 6
  • 17
  • I have some code that will calculate this difference between all rows by email, but I'm not sure how to delete the differences for when there is a trial/demo and only calls afterwards. Or delete the differences between trials and demos within the same email group. – Hillary Mar 27 '15 at 23:05

1 Answers1

2

This kind of data manipulation might be easier to do in SQL. To do it in R, you'll want to use data.table over dataframe.

The following solution isn't the most elegant, but it should scale. Maybe it will give you an idea of how to do it without creating a bunch of new columns like I did. Worst case, you can put this in a loop until all TD's are covered.

Basically I just did a series of conditional statements across rows.

setkey(dt,email)

dt[ActionType=="Call",call_times:=Date_Time] #Field with call times only for taking mins
dt[TD=="TD",TDtime:=Date_Time] # same thing with TD
dt[,first_call:=min(call_times,na.rm=TRUE),by=email] # date time of first call for all records from an email
legit<-unique(dt[TDtime>first_call,email]) # only keeping records for emails where there was a TD after the first call
dt<-dt[.(legit)] 
dt<-dt[Date_Time>first_call|ActionType=="Call"] # also removing TDs that happened before first call
dt[,first_TD:=min(TDtime,na.rm=TRUE),by=email] # same with TD
dt[call_times>first_TD,call_times_2:=Date_Time] #find all calls after the first TD
dt[,second_call:=min(call_times_2,na.rm = TRUE),by=email] #find the time of the first call after the first TD
dt[TDtime>second_call,TDtimes_2:=Date_Time] #find all TDs after the second group of calls
dt[,second_TD:=min(TDtimes_2,na.rm=TRUE),by=email] #find the first TD after second group of calls starts

dt[Date_Time<=first_TD,call_group:=1] # group calls
dt[Date_Time>first_TD&Date_Time<=second_TD&second_TD!=Inf,call_group:=2] 

dt[!is.na(call_group),time_diff:=c(0,(diff(as.numeric(Date_Time))/3600)),by=.(email,call_group)] #find lagged differences between the call times within each call group. (in hours)
dt[!is.na(time_diff),.(email,ActionType,Date_Time,time_diff)] 

At the end you can calculate time differences however you want. I just did hours for simplicity.

  email ActionType           Date_Time   time_diff
1:  bbbb       Call 2014-12-09 16:17:00   0.0000000
2:  bbbb       Call 2014-12-09 16:17:00   0.0000000
3:  bbbb       Call 2014-12-10 16:31:00  24.2333333
4:  bbbb       Call 2014-12-12 16:45:00  48.2333333
5:  bbbb       demo 2014-12-12 17:17:00   0.5333333
6: jjjjj       Call 2014-10-15 19:46:00   0.0000000
7: jjjjj      trial 2014-11-12 19:05:00 672.3166667
8:  kkkk       Call 2014-10-24 00:06:00   0.0000000
9:  kkkk       Call 2014-10-24 00:06:00   0.0000000
10:  kkkk       Call 2014-10-24 13:30:00  13.4000000
11:  kkkk      trial 2014-10-27 12:45:00  71.2500000
12:  kkkk       Call 2015-01-23 14:31:00   0.0000000
13:  kkkk       Call 2015-01-26 21:15:00  78.7333333
14:  kkkk      Trial 2015-01-27 21:15:00  24.0000000
Z. Kimble
  • 118
  • 5