0

I have a very large data set (7 million rows), with approximatively hundred thousands of different ID. Each ID is unique. I would like to create some groups based on the difference between consecutive dates within each ID. I managed to create my own function (Thanks to this post Cumulative sum until maximum reached, then repeat from zero in the next row) and using the great package data.table I apply it to each ID. My function works fine but it is a bit slow (11 mns for 3,6 millions rows).

I would like to know if there are others ways to tackle this problem(avoiding loop for instance) or to speed up my function(With data.table etc..)

The approach I used is to sum the variable representating the difference between consecutive events(Difftime_secs in the code) until they reach some value (either that value or greater than it), at which point, another variable(named group in the code) takes the value 1 and we restart the summing. At the end, I do a cumulative sum of the variable group to get my groups(variable group_less_6s).

This is the code I have used so far with a small example.

   library(data.table)
    set.seed(7)
    # Creating vector of dates
    datetime <- as.POSIXct(as.character(c("2/5/15 16:28:38", "2/5/15 16:28:39", "2/5/15 16:28:42","2/5/15 16:28:43", "2/5/15 16:29:48","2/5/15 16:29:52","2/5/15 16:29:53","2/5/15 16:29:59","2/5/15 16:30:58","2/5/15 16:30:59",
                                          "2/5/15 16:28:33", "2/5/15 16:28:35", "2/5/15 16:28:48","2/5/15 16:28:49", "2/5/15 16:29:53","2/5/15 16:29:58","2/5/15 16:30:53","2/5/15 16:40:59","2/5/15 16:41:00","2/5/15 16:41:01")), format="%m/%d/%y %H:%M:%S")
    #Creating IDs
    ID=c(rep(1,10),rep(2,10))
    #Creating data frame
    df <- data.frame(ID,datetime)
    # Creating a lag difference for each ID
    setDT(df)[, Difftime_secs := c(0, round(difftime(datetime[-1], datetime[-.N], 
                                                                 units='secs'),digits =5)), by=ID]
    # We will order the data set by id and date time
    df=df[order(ID,datetime), ]

    # Function to create groups if the difference between two dates is less than 6 seconds within each ID

    sequence=function(dt){
      dt=as.data.frame(dt)
      # We initiate some temporary variable for the sum
      current.sum <- 0
      # we set a binary variable named group to zero
      dt["group"]=0
      for (i in 1:nrow(dt)) {

        current.sum <- current.sum + dt[i,"Difftime_secs"]
        #Each time the condition is met, the binary variable value is set to 1
        if (current.sum >=6) {
          dt[i, "group"] <- 1
          current.sum <- 0}}
      #To get a group we do a cumulative sum of the binary variable (+ 1 just to avoid a group zero)
      dt$group_less_6s=cumsum(dt$group)+1
      return(dt)
    }

    # We will apply the function to all ids 
    df_final <- df[, sequence(.SD), by=ID]

This is the output that I get

ID            datetime Difftime_secs group group_less_6s
 1 2015-02-05 16:28:38             0     0             1
 1 2015-02-05 16:28:39             1     0             1
 1 2015-02-05 16:28:42             3     0             1
 1 2015-02-05 16:28:43             1     0             1     <- End 1rst grp ID 1
 1 2015-02-05 16:29:48            65     1             2
 1 2015-02-05 16:29:52             4     0             2
 1 2015-02-05 16:29:53             1     0             2     <- End 2nd grp ID 1
 1 2015-02-05 16:29:59             6     1             3
 1 2015-02-05 16:30:58            59     1             4  
 1 2015-02-05 16:30:59             1     0             4
 2 2015-02-05 16:28:33             0     0             1
 2 2015-02-05 16:28:35             2     0             1     <- End 1rst grp ID 2
 2 2015-02-05 16:28:48            13     1             2
 2 2015-02-05 16:28:49             1     0             2
 2 2015-02-05 16:29:53            64     1             3
 2 2015-02-05 16:29:58             5     0             3
 2 2015-02-05 16:30:53            55     1             4
 2 2015-02-05 16:40:59           606     1             5
 2 2015-02-05 16:41:00             1     0             5
 2 2015-02-05 16:41:01             1     0             5     <- End 5th grp ID 2

How can I make it more efficient? Thank a lot for your help.

Community
  • 1
  • 1
M.Barry
  • 1
  • 1
  • Did you look at the other solutions from the post you linked? Especially the final one seems to be more vectorized, thus having more speed. – Heroka Jan 20 '16 at 10:36
  • Were they not fast enough? Or could you not get them to work with your data? – Heroka Jan 20 '16 at 10:45
  • they didn't work on my data and even in the comments, they mention that it is not the exact solution – M.Barry Jan 20 '16 at 10:48

0 Answers0