26

The problem I am trying to solve is that I have a data frame with a sorted POSIXct variable in it. Each row is categorized and I want to get the time differences between each row for each level and add that data back into a new variable. The reproducible problem is as below. The below function is just for creating sample data with random times for the purpose of this question.

random.time <- function(N, start, end) {
  st <- as.POSIXct(start)
  en <- as.POSIXct(end)
  dt <- as.numeric(difftime(en, st, unit="sec"))
  ev <- sort(runif(N, 0, dt))
  rt <- st + ev
  return(rt)
}

The code for simulating the problem is as below:

set.seed(123)
category <- sample(LETTERS[1:5], 20, replace=TRUE)
randtime <- random.time(20, '2015/06/01 08:00:00', '2015/06/01 18:00:00')
df <- data.frame(category, randtime)

The expected resulting data frame is as below:

>category randtime timediff (secs)
>A  2015-06-01 09:05:00 0
>A  2015-06-01 09:06:30 90
>A  2015-06-01 09:10:00 210
>B  2015-06-01 10:18:58 0
>B  2015-06-01 10:19:58 60
>C  2015-06-01 08:14:00 0
>C  2015-06-01 08:16:30 150

Each subgroup in the output will have the first row with timediff value of 0 as there is no previous row. I was able to group by category and call the following function to calculate the differences but could not get it to collate the final output for all category groups.

getTimeDiff <- function(x) {
  no_rows <- nrow(x)
  if(no_rows > 1) {
    for(i in 2:no_rows) {
      t <- x[i, "randtime"] - x[i-1, "randtime"]
    }
  }
}

I have been at this for two days now without luck so would greatly appreciate any help. Thanks.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Mntester
  • 269
  • 1
  • 3
  • 4

2 Answers2

29

Try this:

library(dplyr)
df %>%
  arrange(category, randtime) %>%
  group_by(category) %>%
  mutate(diff = randtime - lag(randtime),
         diff_secs = as.numeric(diff, units = 'secs'))

#   category            randtime             diff   diff_secs
#     (fctr)              (time)           (dfft)       (dbl)
# 1        A 2015-06-01 11:10:54         NA hours          NA
# 2        A 2015-06-01 15:35:04   4.402785 hours   15850.027
# 3        A 2015-06-01 17:01:22   1.438395 hours    5178.222
# 4        B 2015-06-01 08:14:46         NA hours          NA
# 5        B 2015-06-01 16:53:43 518.955379 hours 1868239.364
# 6        B 2015-06-01 17:37:48  44.090950 hours  158727.420

You may also want to add replace(is.na(.), 0) to the chain.

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • thanks for the quick response. This solves it partly in arranging the output as required. However the lag() returns the previous row's time as the timediff value and not the actual difference. – Mntester Oct 07 '15 at 18:28
21

In base R you can use:

# creating an ordered data.frame
df <- data.frame(category, randtime)
df <- df[order(df$category, df$randtime),]

# calculating the timedifference
# option 1:
df$tdiff <- unlist(tapply(df$randtime, INDEX = df$category,
                          FUN = function(x) c(0, `units<-`(diff(x), "secs"))))
# option 2:
df$tdiff <- unlist(tapply(df$randtime, INDEX = df$category,
                          FUN = function(x) c(0, diff(as.numeric(x)))))

which gives:

> df
   category            randtime      tdiff
6         A 2015-06-01 11:10:54     0.0000
15        A 2015-06-01 15:35:04 15850.0271
18        A 2015-06-01 17:01:22  5178.2223
1         B 2015-06-01 08:14:46     0.0000
17        B 2015-06-01 16:53:43 31137.3227
19        B 2015-06-01 17:37:48  2645.4570
3         C 2015-06-01 10:09:50     0.0000
7         C 2015-06-01 12:46:40  9409.9693
9         C 2015-06-01 13:56:29  4188.4578
10        C 2015-06-01 14:24:18  1669.1326
12        C 2015-06-01 14:54:25  1807.1447
14        C 2015-06-01 15:05:07   641.7068
2         D 2015-06-01 09:28:16     0.0000
13        D 2015-06-01 14:55:40 19644.8313
4         E 2015-06-01 10:18:58     0.0000
5         E 2015-06-01 10:53:29  2071.2223
8         E 2015-06-01 13:26:26  9176.6263
11        E 2015-06-01 14:33:25  4019.0319
16        E 2015-06-01 15:57:16  5031.4183
20        E 2015-06-01 17:56:33  7156.8849

If you want minutes or hours, you can use "mins" or "hours" instead of "secs".


An alternative with the data.table package:

library(data.table)
# creating an ordered/keyed data.table
dt <- data.table(category, randtime, key = c("category", "randtime"))

# calculating the timedifference
# option 1:
dt[, tdiff := difftime(randtime, shift(randtime, fill=randtime[1L]), units="secs"), by=category]
# option 2:
dt[, tdiff := c(0, `units<-`(diff(randtime), "secs")), by = category]
# option 3:
dt[ , test := c(0, diff(as.numeric(randtime))), category]

which results in:

> dt
    category            randtime           tdiff
 1:        A 2015-06-01 11:10:54     0.0000 secs
 2:        A 2015-06-01 15:35:04 15850.0271 secs
 3:        A 2015-06-01 17:01:22  5178.2223 secs
 4:        B 2015-06-01 08:14:46     0.0000 secs
 5:        B 2015-06-01 16:53:43 31137.3227 secs
 6:        B 2015-06-01 17:37:48  2645.4570 secs
 7:        C 2015-06-01 10:09:50     0.0000 secs
 8:        C 2015-06-01 12:46:40  9409.9693 secs
 9:        C 2015-06-01 13:56:29  4188.4578 secs
10:        C 2015-06-01 14:24:18  1669.1326 secs
11:        C 2015-06-01 14:54:25  1807.1447 secs
12:        C 2015-06-01 15:05:07   641.7068 secs
13:        D 2015-06-01 09:28:16     0.0000 secs
14:        D 2015-06-01 14:55:40 19644.8313 secs
15:        E 2015-06-01 10:18:58     0.0000 secs
16:        E 2015-06-01 10:53:29  2071.2223 secs
17:        E 2015-06-01 13:26:26  9176.6263 secs
18:        E 2015-06-01 14:33:25  4019.0319 secs
19:        E 2015-06-01 15:57:16  5031.4183 secs
20:        E 2015-06-01 17:56:33  7156.8849 secs
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Yes, both these solutions work great. Thanks a lot, all of you. – Mntester Oct 07 '15 at 20:02
  • @Mntester extended the answer with a base R solution – Jaap Jan 14 '16 at 10:59
  • Wish I could have more than one upvote, such a nice, clean `data.table` solution! Note use `first_removed <- dt[dt[, -.I[1], by = category]$V1]` to remove the first row per category from the resulting data.table if needed (in my case I'm using the difference as a feature, so I don't want the zeros). – Bar Jul 22 '16 at 23:58
  • @Bar Another option to remove the first observation by group is `dt[, tail(.SD, -1), by = category]`, but on very large datasets the `.I` method wins on performance ([see here for a benchmark](http://stackoverflow.com/a/34753260/2204410)). – Jaap Jul 23 '16 at 07:15
  • To me this is easier to understand `dt[ , test:=c(0, diff(as.numeric(randtime))), category]`. I don't know how anyone could think of that `<-units()` trick. I'm impressed, but it's sad that you had to resort to that. – geneorama Jun 15 '18 at 23:59
  • @geneorama Agreed on that. I've updated my answer accordingly. – Jaap Jun 16 '18 at 05:46
  • @Jaap Thanks a lot Jaap, how could we use this if we have another binomial variable for `by` to process(along with category) - which is a toggle status for the time column and we'd need to sum the total time using this toggle column only - by differencing the successive toggles first and then adding cumulatively. – Keyshov Borate Jun 29 '18 at 13:47
  • @KeyshovBorate I'm not completely sure what you mean. Maybe post as a new question? You could link back to this one. Also make sure to include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) – Jaap Jun 30 '18 at 15:03
  • @Jaap Sure, here it is, Thanks! https://stackoverflow.com/questions/51118608/calculating-conditional-cumulative-time-in-r – Keyshov Borate Jun 30 '18 at 21:30