3

Say I have a data.table that looks as follows:

dt = data.table(group = c(1,1,1,2,2,2,3,3,3),time = c("2016-03-09T08:31:00-05:00","2016-03-08T11:31:00-05:00","2016-03-06T08:31:00-05:00",
                                                "2016-04-04T23:28:00-04:00","2016-04-10T23:28:00-04:00","2016-04-09T23:28:00-04:00",
                                                "2016-05-11T19:52:00-04:00","2016-05-10T20:52:00-04:00","2016-04-11T19:52:00-04:00"))

dt
   group                      time
1:     1 2016-03-09T08:31:00-05:00
2:     1 2016-03-08T11:31:00-05:00
3:     1 2016-03-06T08:31:00-05:00
4:     2 2016-04-04T23:28:00-04:00
5:     2 2016-04-10T23:28:00-04:00
6:     2 2016-04-09T23:28:00-04:00
7:     3 2016-05-11T19:52:00-04:00
8:     3 2016-05-10T20:52:00-04:00
9:     3 2016-04-11T19:52:00-04:00

For each group in this data.table, I want to only retain the observations that are within 24 hours of the most recent date. I cooked up a nasty solution for this, but it's not nearly as fast as I need it to be on large datasets.

library(lubridate)
set(dt,j = "time",value = ymd_hms(dt[["time"]]))
dt[,.(mostRecent = max(time),time),by = group][
  time > (mostRecent - days(1)),.(group,time)]

   group                time
1:     1 2016-03-09 13:31:00
2:     1 2016-03-08 16:31:00
3:     2 2016-04-11 03:28:00
4:     3 2016-05-11 23:52:00
5:     3 2016-05-11 00:52:00

Does anyone have tips on how to accomplish more elegantly/faster?

Boudewijn Aasman
  • 1,236
  • 1
  • 13
  • 20

3 Answers3

4

First, put the thresholds in a table:

thresh_dt = dt[, .(time = max(time)), by=group][, time := time - 24*60*60][]

The max is taken separately from subtracting the day's worth of seconds to take advantage of the "GForce" optimized max. Also see ?datatable.optimize.

Next, do a rolling or non-equi join:

thresh_dt[dt, on=c("group", "time"), roll=TRUE, nomatch=0]

# or, on data.table 1.9.7+
thresh_dt[dt, on=.(group, time <= time), nomatch=0]

   group                time
1:     1 2016-03-09 13:31:00
2:     1 2016-03-08 16:31:00
3:     2 2016-04-11 03:28:00
4:     2 2016-04-10 03:28:00
5:     3 2016-05-11 23:52:00
6:     3 2016-05-11 00:52:00

Benchmark. The benefits of the GForce max and rolling only show up when you have enough groups. My example data extends @sbstn's so that the number of groups is a parameter:

N = 5e6
ng = 1e5

all_times = seq(from = as.POSIXct('2016-01-01 10:00:00'),
                to = as.POSIXct('2016-06-30 10:00:00'),
                by = 60)
all_times_int = as.integer(all_times)    
idx = sample(seq.int(length(all_times)), N, replace = TRUE)
dt = data.table(group = sample(ng, N, replace = TRUE),
                time = all_times[idx],
                time_int = all_times_int[idx])

# sbstn, no gmax
system.time({
  dt[, cutoff_time := max(time) - 24*60*60, by = group]
  dt[time >= cutoff_time]
})
#    user  system elapsed 
#    8.50    0.01    8.47 

# sbstn, with gmax
system.time({
  dt[, maxtime := max(time), by = group][, cutoff_time := maxtime - 24*60*60]
  dt[time >= maxtime]
})
#    user  system elapsed 
#    4.98    0.01    4.99

# gmax and roll
system.time({
  thresh_dt = dt[, .(time = max(time)), by=group][, time := time - 24*60*60]
  thresh_dt[dt, on=c("group", "time"), roll=TRUE, nomatch=0][, list(group, time)]
})
#    user  system elapsed 
#    1.29    0.06    1.36 
# (Caveat: I didn't verify that these results match.)

My answer involves grouping rows twice (once to calculate the max, and again to join back with the original data). By cutting this down to one grouping operation, Clayton Stanley's answer also manages to be fast (at least I think that's what's going on):

system.time(dt[order(group, -time)
    ][, groupP := shift(group, type='lag')
    ][, head := is.na(groupP) | group != groupP
    ][, copy(.SD)[.SD[head == T], rTime := i.time, on=c(group='group')]
    ][time > (rTime - 24*60*60)
    ][, .(group, time)
    ][order(group, -time)
    ])
#    user  system elapsed 
#    1.32    0.25    1.14 
Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
4

Simple solution creating a cutoff time for each group (assuming time is already transformed):

dt[, cutoff_time := max(time) - 24*60*60, by = group]
dt[time > cutoff_time]

EDIT:

The comment on "GForce optimized max" made me curious, so I created some bigger fake data in order to compare speed. Note that integer plays nicely with both max and >=:

require(data.table)

require(microbenchmark)

N = 100000
N_g = 100

all_times = seq(from = as.POSIXct('2016-01-01 10:00:00'),
                to = as.POSIXct('2016-06-30 10:00:00'),
                by = 60)

all_times_int = as.integer(all_times)

idx = sample(seq.int(length(all_times)), N, replace = TRUE)

dt = data.table(group = sample(seq.int(N_g), N, replace = TRUE),
                time = all_times[idx],
                time_int = all_times_int[idx])

f1a = function (x) {
  x[, cutoff_time := max(time) - 24*60*60, by = group]
  x[time >= cutoff_time, list(group, time)]
}

f1b = function (x) {
  x[, cutoff_time := max(time_int) - 24*60*60, by = group]
  x[time_int >= cutoff_time, list(group, time)]
}

f2 = function (x) {
  thresh_dt = x[, .(time = max(time)), by=group][, time := time - 24*60*60]
  thresh_dt[x, on=c("group", "time"), roll=TRUE, nomatch=0][, list(group, time)]
}

microbenchmark(f1a(dt),
               f1b(dt),
               f2(dt))

Unit: milliseconds
   expr       min        lq      mean   median        uq       max neval
 f1a(dt)  9.842106 10.593243 11.593148 11.62311 12.478853 14.335338   100
 f1b(dt)  3.391178  3.763598  4.403264  4.00142  5.018182  8.335717   100
  f2(dt) 14.422669 15.701397 17.090674 16.56990 17.695653 52.926897   100

identical(f1a(dt), f1b(dt)) # TRUE
identical(f1a(dt), f2(dt)) # TRUE

EDIT 2: And one more with N = 1,000,000 and N_g = 10,000 groups:

> microbenchmark(f1a(dt),
+                f1b(dt),
+                f2(dt),
+                times = 10)
Unit: milliseconds
    expr       min       lq      mean    median        uq      max neval
 f1a(dt) 634.91473 647.5662 670.74597 663.28238 694.29595 728.2481    10
 f1b(dt)  64.61488  67.3692  76.68925  68.42335  72.36862 113.1407    10
  f2(dt) 205.67688 208.6491 229.65610 213.59476 249.16703 278.7713    10

> microbenchmark(f1a(dt),
+                f1b(dt),
+                f2(dt),
+                times = 10)
Unit: milliseconds
    expr       min        lq     mean    median        uq       max neval
 f1a(dt) 620.11090 624.33587 645.0220 642.13648 657.74347 697.27674    10
 f1b(dt)  64.80214  67.43851  67.9140  67.99647  68.63552  69.74466    10
  f2(dt) 198.39200 199.56088 209.6908 204.60183 216.23255 241.76792    10

> microbenchmark(f1a(dt),
+                f1b(dt),
+                f2(dt),
+                times = 10)
Unit: milliseconds
    expr      min        lq      mean    median        uq      max neval
 f1a(dt) 619.2903 645.22617 656.58883 660.99508 664.82678 682.7618    10
 f1b(dt)  63.2454  67.31781  72.10255  68.19679  71.91441 106.7493    10
  f2(dt) 195.9335 210.06171 222.19868 215.75979 241.74100 245.9022    10
sbstn
  • 628
  • 3
  • 7
  • I'm confused by all_times_int. Are those still times in some sense? – Frank Jul 29 '16 at 20:54
  • AFAIK you can convert time (in seconds) to `integer`, by just counting seconds from a certain starting point. If you care about time differences you can simply work with `integers` because you don't care about the actual `YYYY-mm-dd HH:MM:SS` representation... Working with `integer` can be faster than using `POSIX`. Maybe someone with more experience on this could say something about `data.table::IDateTime`... – sbstn Jul 29 '16 at 21:00
  • Ok, thanks, sounds like you just lose milliseconds and a nice representation. – Frank Jul 29 '16 at 21:00
  • I looked into it some and found that the number of groups is important to what you see in the benchmark, not only the number of rows. I've added it below. This is common with data.table. – Frank Jul 29 '16 at 21:16
  • Oh, just noticed that you added in N_g, however, it's still not used in your code (which still has 1:100 hard-coded). – Frank Jul 29 '16 at 21:17
  • Okay. But really your test does not get at the performance of gmax. It's not like my solution uses it and yours cannot. I show a variation on your solution that uses it too, which offers a truer examination, I reckon. – Frank Jul 29 '16 at 21:19
  • Sorry, actually there was a hiccup in the `microbenchmark`, profuse apologies... Now the numbers are correct! Ran it three times just to be sure :) It's getting late here – sbstn Jul 29 '16 at 21:24
3

Probably the bottleneck is in the max(*), by computation. If so:

dt[order(group, -time)
   ][, groupP := shift(group, type='lag')
   ][, head := is.na(groupP) | group != groupP
   ][, copy(.SD)[.SD[head == T], rTime := i.time, on=c(group='group')]
   ][time > (rTime - 24*60*60)
   ][, .(group, time)
   ][order(group, -time)
   ]

   group                time
1:     1 2016-03-09 13:31:00
2:     1 2016-03-08 16:31:00
3:     2 2016-04-11 03:28:00
4:     3 2016-05-11 23:52:00
5:     3 2016-05-11 00:52:00
> 
Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46
  • 1
    Fyi, when `max` is the only thing in `j`, some extra optimization is done, as seen with `dt[, max(time), by=group, verbose=TRUE]` – Frank Jul 29 '16 at 20:20
  • @Frank I used your test code and found that the subtraction from using days(*) was the bottleneck in this code. Curious how this one fairs on your machine compared to the others you tested, as it may be close. – Clayton Stanley Jul 29 '16 at 21:35
  • 1
    1.14 seconds, so a little faster than rolling, yup. (Non-equi join, not shown in the benchmark above, is similar to rolling.) – Frank Jul 29 '16 at 21:38