-1

I have a bit of a unique question. I've tried a few different things which I'll detail after the problem itself.

The problem: For each user ID, I need to iterate through event dates and check if each date is within 30 days of the next date. I have 260,000 records, and a not-insignificant number of IDs only have a single entry. The data look like:

id |    date1   |   date2
1  | 2016-01-01 | 2016-02-12

and so on

I have tried:

  • foreach (split out each ID's set of events, calculate, recombine; ran into memory issues).
  • data.table, but I don't know enough to know if I exhausted this option.
  • briefly dplyr (namely:

    mutate(time_btwn=abs(as.numeric(difftime(data$date,lag(data$date2,1),"days")))))
    
  • and I'm currently running a straight for loop that iterates through all rows. It is extremely slow and I wish I didn't have to do it. The code:

    for ( i in 2:nrow(data) ){
        if ( data$id[ i ] != data$id[ i - 1 ] ){
            next
        } else {
            data$timebtwn[i] <- abs( as.numeric( difftime( data$date1[i], data$date2[ i - 1 ], "days" ) ) )
        }
    }
    

I've looked into apply and lapply, but can't quite work out the function to plug into apply or lapply that will do what I need (i.e. for each entry in column1, check one row back in column2 and return the difference between the dates IF both rows have the same id).

Is there a faster way than a straight for loop (or a way using foreach) that is fast and not memory intensive?

user3933614
  • 115
  • 1
  • 5
  • 2
    Can you `dput()` an example of your data? – effel Jun 10 '16 at 18:26
  • It sounds like you have a lot of columns you need to operate on, and therefore need to melt to long so you can operate on a single column. Long in this case may be _very_ long, though, so workarounds may be appropriate. – alistaire Jun 10 '16 at 18:30
  • Not without interrupting the for loop I mentioned. The data is literally what I said, though. ID, date1, date2. Sorry :( If and when it finishes I'll update the post. – user3933614 Jun 10 '16 at 18:32
  • Do you need a boolean vector that says whether each row has a difference less than or equal to 30? You could easily create a new date_diff column and then just check whether it's less than or equal to 30. – lmkirvan Jun 10 '16 at 18:47
  • You should make the code in question reproducible. – jangorecki Jun 10 '16 at 18:48

1 Answers1

1

Since I do not have a sample dataset to work with, I had to make one up, and thus it is difficult to know what exactly you are after, but:

library(data.table)
library(lubridate)

# generate random date samples
latemail <- function(N, st="2012/01/01", et="2015/12/31") {
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- sort(runif(N, 0, dt))
  rt <- as_date(st + ev)
}

set.seed(42)
mydat<-data.table(id = as.character(sample.int(1000, 10000, replace =T)),
                  date1 = as_date(latemail(10000)),
                  date2 = as_date(latemail(10000)))
setkey(mydat, id)

mydat[, .(timebtw = abs( as.numeric(difftime(date1, date2), "days" )),
          date1 = date1,
          date2 = date2), by = id]

#     id timebtw      date1      date2
#1:   1       4 2012-01-15 2012-01-11
#2:   1       2 2012-03-21 2012-03-19
#3:   1       9 2012-10-01 2012-10-10
#4:   1       1 2013-08-08 2013-08-09
#5:   1       9 2014-02-11 2014-02-02
#---                                  
#9996: 999       7 2014-10-28 2014-11-04
#9997: 999       9 2015-03-28 2015-04-06
#9998: 999       0 2015-07-22 2015-07-22
#9999: 999      10 2015-09-06 2015-09-16
#10000: 999       8 2015-10-03 2015-10-11

I got the date generating function from this nice post. Let me know if this is what you are trying to do. This example has 10,000 rows and 999 unique ids. To illustrate the speed:

system.time(
    mydat[, .(timebtw = abs( as.numeric(difftime(date1, date2), "days")),
              date1 = date1,
              date2 = date2), by = id])
#user  system elapsed 
#0.26    0.00    0.26 
Community
  • 1
  • 1
Bryan Goggin
  • 2,449
  • 15
  • 17
  • It isn't, I'm sorry :( I need to find the difference between date1 from row i and date2 from row i+1, and insert that into row i of the new column. – user3933614 Jun 13 '16 at 13:48
  • I worked it out! data.table actually isn't the way to go. All I needed was data <- group_by(data, id) %>% mutate(time_btwn = abs( as.numeric( difftime( date2, lead(date1,1), units = "days" ) ) ) ) And that's all. Wild! – user3933614 Jun 13 '16 at 14:29
  • Glad you got it sorted out. – Bryan Goggin Jun 13 '16 at 15:03