R - Efficiency for calculating a column in a data frame

Question

I have a couple of columns in a data frame with times. I'm trying to calculate the difference in times in a new column, but I need to reset it to zero every time I encounter the start of a new pattern. Please see some sample data below.

Seq  ATime      RT
0    18:33:00   0
20   18:48:00   15
43   19:01:00   13
56   19:47:00   24
0    21:33:00   0
9    21:45:00   12
22   21:55:00   10
45   22:13:00   18
0    06:33:00   0
22   06:47:00   14
45   06:59:00   12
62   07:22:00   23
85   07:48:00   26

I'm using the following script to estimate the delta column. The Seq column is always increasing for each 'pattern'. In this sample each pattern's Seq starts from 0, but it may not be the case always.

dat_4$RT <- 0
for (i in 1:(NROW(dat_4$Seq)-1)) {
  if (dat_4$Seq[i+1] > dat_4$Seq[i]) {
    dat_4$RT[i+1] = (chron(times=dat_4$ATime[i+1]) - chron(times=dat_4$ATime[i]))*1440
  } else {
    dat_4$RT[i+1] = 0
  }
}

Although it works, it's not at all efficient. Sometimes the 'dat_4' dataframe will have roughly 4 million records and it takes almost 2.5 minutes just to process this step.

   user  system elapsed 
  96.86   54.07  150.99

Any suggestion on how I can make it more efficient?

Make sure your example is [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You code uses a `Seq` column but that doesn't seem to appear anywhere in the sample data. — MrFlick, Aug 21 '17 at 20:14
Your code also uses `ATime` and `DTime` columns which are not in sample data. — Eric Watt, Aug 21 '17 at 20:29
`ATime` and `DTime` are basically times, just like the `Time` column. But, I've updated the code and data. — Sujith, Aug 21 '17 at 20:34
You may use the [`cumsum()` idiom](https://stackoverflow.com/questions/5222061/how-to-partition-a-vector-into-groups-of-regular-consecutive-sequences) to create a grouping variable, and then use `ave` to calculate time `diff`erences within group: `ave(chron(times = d$Time), cumsum(c(1, diff(d$Seq) < 0)), FUN = function(x) c(0, diff(x)))` — Henrik, Aug 21 '17 at 21:09

score 0 · Answer 1 · answered Aug 21 '17 at 20:44

You can do this using the (built-in) difftime function, as well as lag and mutate from dplyr:

library(dplyr)
dat_4 %>%
  mutate(RT = as.numeric(difftime(ATime, lag(ATime, default = ATime[1]), units = "min")),
         RT = RT * (Seq > lag(Seq, default = 0)))

The first step computes the difference (in minutes) between each pair, with the first starting at zero. The second turns all the cases where Seq isn't increasing to 0.

Eric Watt · Accepted Answer · 2017-08-21T21:23:24.723

0

You can do this by first calculating RT for all rows, and then finding all of the rows that should be set to 0. This avoids the for loop, and likely could be faster.

dat_4$RT <- c(0, diff(chron(times=dat_4$ATime)) * 1440)
dat_4$RT[which(sign(diff(dat_4$Seq)) == -1) + 1] <- 0

The first line will do the diff using chron similar to how you did, but avoids doing this in a loop. The second line detects when Seq has decreased, and sets those rows to have a RT of zero.

edited Aug 21 '17 at 21:23

answered Aug 21 '17 at 21:03

Eric Watt

3,180
9
21

The run time improved significantly to 0.14 seconds. – Sujith Aug 22 '17 at 14:39

R - Efficiency for calculating a column in a data frame

2 Answers2