I have a bit of a unique question. I've tried a few different things which I'll detail after the problem itself.
The problem: For each user ID, I need to iterate through event dates and check if each date is within 30 days of the next date. I have 260,000 records, and a not-insignificant number of IDs only have a single entry. The data look like:
id | date1 | date2
1 | 2016-01-01 | 2016-02-12
and so on
I have tried:
- foreach (split out each ID's set of events, calculate, recombine; ran into memory issues).
- data.table, but I don't know enough to know if I exhausted this option.
briefly dplyr (namely:
mutate(time_btwn=abs(as.numeric(difftime(data$date,lag(data$date2,1),"days")))))
and I'm currently running a straight for loop that iterates through all rows. It is extremely slow and I wish I didn't have to do it. The code:
for ( i in 2:nrow(data) ){ if ( data$id[ i ] != data$id[ i - 1 ] ){ next } else { data$timebtwn[i] <- abs( as.numeric( difftime( data$date1[i], data$date2[ i - 1 ], "days" ) ) ) } }
I've looked into apply and lapply, but can't quite work out the function to plug into apply or lapply that will do what I need (i.e. for each entry in column1, check one row back in column2 and return the difference between the dates IF both rows have the same id).
Is there a faster way than a straight for loop (or a way using foreach) that is fast and not memory intensive?