I have a very large data set (7 million rows), with approximatively hundred thousands of different ID. Each ID is unique. I would like to create some groups based on the difference between consecutive dates within each ID. I managed to create my own function (Thanks to this post Cumulative sum until maximum reached, then repeat from zero in the next row) and using the great package data.table I apply it to each ID. My function works fine but it is a bit slow (11 mns for 3,6 millions rows).
I would like to know if there are others ways to tackle this problem(avoiding loop for instance) or to speed up my function(With data.table etc..)
The approach I used is to sum the variable representating the difference between consecutive events(Difftime_secs in the code) until they reach some value (either that value or greater than it), at which point, another variable(named group in the code) takes the value 1 and we restart the summing. At the end, I do a cumulative sum of the variable group to get my groups(variable group_less_6s).
This is the code I have used so far with a small example.
library(data.table)
set.seed(7)
# Creating vector of dates
datetime <- as.POSIXct(as.character(c("2/5/15 16:28:38", "2/5/15 16:28:39", "2/5/15 16:28:42","2/5/15 16:28:43", "2/5/15 16:29:48","2/5/15 16:29:52","2/5/15 16:29:53","2/5/15 16:29:59","2/5/15 16:30:58","2/5/15 16:30:59",
"2/5/15 16:28:33", "2/5/15 16:28:35", "2/5/15 16:28:48","2/5/15 16:28:49", "2/5/15 16:29:53","2/5/15 16:29:58","2/5/15 16:30:53","2/5/15 16:40:59","2/5/15 16:41:00","2/5/15 16:41:01")), format="%m/%d/%y %H:%M:%S")
#Creating IDs
ID=c(rep(1,10),rep(2,10))
#Creating data frame
df <- data.frame(ID,datetime)
# Creating a lag difference for each ID
setDT(df)[, Difftime_secs := c(0, round(difftime(datetime[-1], datetime[-.N],
units='secs'),digits =5)), by=ID]
# We will order the data set by id and date time
df=df[order(ID,datetime), ]
# Function to create groups if the difference between two dates is less than 6 seconds within each ID
sequence=function(dt){
dt=as.data.frame(dt)
# We initiate some temporary variable for the sum
current.sum <- 0
# we set a binary variable named group to zero
dt["group"]=0
for (i in 1:nrow(dt)) {
current.sum <- current.sum + dt[i,"Difftime_secs"]
#Each time the condition is met, the binary variable value is set to 1
if (current.sum >=6) {
dt[i, "group"] <- 1
current.sum <- 0}}
#To get a group we do a cumulative sum of the binary variable (+ 1 just to avoid a group zero)
dt$group_less_6s=cumsum(dt$group)+1
return(dt)
}
# We will apply the function to all ids
df_final <- df[, sequence(.SD), by=ID]
This is the output that I get
ID datetime Difftime_secs group group_less_6s
1 2015-02-05 16:28:38 0 0 1
1 2015-02-05 16:28:39 1 0 1
1 2015-02-05 16:28:42 3 0 1
1 2015-02-05 16:28:43 1 0 1 <- End 1rst grp ID 1
1 2015-02-05 16:29:48 65 1 2
1 2015-02-05 16:29:52 4 0 2
1 2015-02-05 16:29:53 1 0 2 <- End 2nd grp ID 1
1 2015-02-05 16:29:59 6 1 3
1 2015-02-05 16:30:58 59 1 4
1 2015-02-05 16:30:59 1 0 4
2 2015-02-05 16:28:33 0 0 1
2 2015-02-05 16:28:35 2 0 1 <- End 1rst grp ID 2
2 2015-02-05 16:28:48 13 1 2
2 2015-02-05 16:28:49 1 0 2
2 2015-02-05 16:29:53 64 1 3
2 2015-02-05 16:29:58 5 0 3
2 2015-02-05 16:30:53 55 1 4
2 2015-02-05 16:40:59 606 1 5
2 2015-02-05 16:41:00 1 0 5
2 2015-02-05 16:41:01 1 0 5 <- End 5th grp ID 2
How can I make it more efficient? Thank a lot for your help.