12

Say I have a data table:

dt <- data.table(
        datetime = seq(as.POSIXct("2016-01-01 00:00:00"),as.POSIXct("2016-01-01 10:00:00"), by = "1 hour"),
        ObType = c("A","A","B","B","B","B","A","A","B","A","A")
)

dt
                   datetime ObType
     1: 2016-01-01 00:00:00      A
     2: 2016-01-01 01:00:00      A
     3: 2016-01-01 02:00:00      B
     4: 2016-01-01 03:00:00      B
     5: 2016-01-01 04:00:00      B
     6: 2016-01-01 05:00:00      B
     7: 2016-01-01 06:00:00      A
     8: 2016-01-01 07:00:00      A
     9: 2016-01-01 08:00:00      B
    10: 2016-01-01 09:00:00      A
    11: 2016-01-01 10:00:00      A

What I need to do is wherever the ObType is "B", I need to find the time to the nearest ObType "A" on either side. So the result should look like (in hours):

               datetime ObType timeLag timeLead
 1: 2016-01-01 00:00:00      A      NA       NA
 2: 2016-01-01 01:00:00      A      NA       NA
 3: 2016-01-01 02:00:00      B       1        4
 4: 2016-01-01 03:00:00      B       2        3
 5: 2016-01-01 04:00:00      B       3        2
 6: 2016-01-01 05:00:00      B       4        1
 7: 2016-01-01 06:00:00      A      NA       NA
 8: 2016-01-01 07:00:00      A      NA       NA
 9: 2016-01-01 08:00:00      B       1        1
10: 2016-01-01 09:00:00      A      NA       NA
11: 2016-01-01 10:00:00      A      NA       NA

I usually use data.table, but non data.table solutions are also fine.

Thanks!

Lyss

Bucket
  • 527
  • 2
  • 16
  • 2
    Sorry I don't have time for full answer, but `roll="nearest"` likely needed. Perhaps something like `lookup = setkey(DT[ObType=="A"], datetime)` and then join just the B's to `lookup` using `roll="nearest"`. More info on the bigger picture would be useful please in case it can be done a different way. – Matt Dowle Feb 21 '17 at 23:36
  • Are there only these two "ObTypes"? You could, simply, try `findInterval` -- `a = dt$datetime[dt$ObType == "A"]; ib = dt$ObType == "B"; b = dt$datetime[ib]; i = findInterval(b, a); cbind(replace(NA, ib, b - a[i]), replace(NA, ib, a[i + 1] - b))` – alexis_laz Feb 22 '17 at 08:17

3 Answers3

10

The approach I hinted at using roll= :

X = dt[ObType=="A"]
X
              datetime ObType
1: 2016-01-01 00:00:00      A
2: 2016-01-01 01:00:00      A
3: 2016-01-01 06:00:00      A
4: 2016-01-01 07:00:00      A
5: 2016-01-01 09:00:00      A
6: 2016-01-01 10:00:00      A

dt[ObType=="B", Lag:=X[.SD,on="datetime",roll=Inf,i.datetime-x.datetime]]
dt[ObType=="B", Lead:=X[.SD,on="datetime",roll=-Inf,x.datetime-i.datetime]]
dt[ObType=="B", Nearest:=X[.SD,on="datetime",roll="nearest",x.datetime-i.datetime]]
dt
               datetime ObType      Lag     Lead     Nearest
 1: 2016-01-01 00:00:00      A NA hours NA hours    NA hours
 2: 2016-01-01 01:00:00      A NA hours NA hours    NA hours
 3: 2016-01-01 02:00:00      B  1 hours  4 hours    -1 hours
 4: 2016-01-01 03:00:00      B  2 hours  3 hours    -2 hours
 5: 2016-01-01 04:00:00      B  3 hours  2 hours     2 hours
 6: 2016-01-01 05:00:00      B  4 hours  1 hours     1 hours
 7: 2016-01-01 06:00:00      A NA hours NA hours    NA hours
 8: 2016-01-01 07:00:00      A NA hours NA hours    NA hours
 9: 2016-01-01 08:00:00      B  1 hours  1 hours    -1 hours
10: 2016-01-01 09:00:00      A NA hours NA hours    NA hours
11: 2016-01-01 10:00:00      A NA hours NA hours    NA hours

One advantage of roll= is that you can apply a staleness limit just by changing the Inf to the limit of time you wish to join within. It's the time difference that the limit applies to, not the number of rows. Inf just means don't limit. The roll= sign indicates whether to look forwards or backwards (lead or lag).

Another advantage is that roll= is fast.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 1
    why not `setkey(dt, ObType, datetime)` first, and then use `dt["B", Lag:=...`? – setempler Feb 22 '17 at 04:47
  • 1
    @setempler Yes that `setkey()` would work too and would be faster but I got the impression the OP didn't want to change the order of their data: it's ordered by time and has A's then B's then A's then B's. – Matt Dowle Feb 22 '17 at 19:36
3
dt$timelag = NA
dt$timelead = NA

A = split(dt, dt$ObType)$A
B = split(dt, dt$ObType)$B

A_time_up = sort(A$datetime)
A_time_dn = sort(A$datetime, decreasing = TRUE)

B$timelag = apply(B, 1, function(x) 
    A_time_up[which(x[1] < A_time_up)[1]]
)

B$timelead = apply(B, 1, function(x) 
    A_time_dn[which(x[1] > A_time_dn)[1]]
)

B$timelag = (B$timelag - as.numeric(B$datetime))/(3600)
B$timelead = (as.numeric(B$datetime) - B$timelead)/(3600)

rbind(A,B)
d.b
  • 32,245
  • 6
  • 36
  • 77
3

Two approaches, one using joins, the other using reshaping

Joins

There is probably a better approach that uses rolling joins / non-equi joins, but here's a brute-force approach

dt2 <- dt[, key := 1][ 
    dt, 
    on = "key", 
    allow.cartesian = T
    ][
        ObType != i.ObType
        ][
            , `:=`(lag_min = datetime - i.datetime,
                         lag_max = i.datetime - datetime)
            ]


dt_min <- dt2[ObType == "B" & lag_min > 0, .(timeLag = min(lag_min)), by = .(datetime, ObType)]
dt_max <- dt2[ObType == "B" & lag_max > 0, .(timeLead = min(lag_max)), by = .(datetime, ObType)]


dt_max[ dt_min[ dt, on = c("datetime", "ObType"), nomatch = NA], on = c("datetime", "ObType"), nomatch = NA]

#                datetime ObType  lag_max  lag_min key
#  1: 2016-01-01 00:00:00      A NA hours NA hours   1
#  2: 2016-01-01 01:00:00      A NA hours NA hours   1
#  3: 2016-01-01 02:00:00      B  4 hours  1 hours   1
#  4: 2016-01-01 03:00:00      B  3 hours  2 hours   1
#  5: 2016-01-01 04:00:00      B  2 hours  3 hours   1
#  6: 2016-01-01 05:00:00      B  1 hours  4 hours   1
#  7: 2016-01-01 06:00:00      A NA hours NA hours   1
#  8: 2016-01-01 07:00:00      A NA hours NA hours   1
#  9: 2016-01-01 08:00:00      B  1 hours  1 hours   1
# 10: 2016-01-01 09:00:00      A NA hours NA hours   1
# 11: 2016-01-01 10:00:00      A NA hours NA hours   1

Reshaping

It's quite involved, and some of the steps can obviously be simplified, but I'm throwing it all in here anyway so you can see the process

dt[, group := rleid(ObType)]
dt_cast <- dcast(dt, formula = datetime + group ~ ObType, value.var = "ObType")

dt_cast[, `:=`(group_before = group - 1,
                             group_after = group + 1)]


dt_min <- dt_cast[ !is.na(B) ][dt_cast[!is.na(A), .(datetime, group)] , on = c(group_before = "group")  , allow.cartesian = T][, max(i.datetime), by = group]
dt_max <- dt_cast[ !is.na(B) ][dt_cast[!is.na(A), .(datetime, group)] , on = c(group_after = "group")  , allow.cartesian = T][, min(i.datetime), by = group]


dt_cast <- rbindlist(list(
    dt_cast[ dt_min, on = c("group"), nomatch = 0],
    dt_cast[ dt_max, on = c("group"), nomatch = 0]
))

dt <- dt_cast[ dt, on = c("datetime", "group"), nomatch = NA][, .(datetime, ObType, lag = V1)]

dt[ObType == "B" , lag_type := c("lag", "lead"), by = .(datetime, ObType)]
dt <- dcast(dt, formula = datetime + ObType ~ lag_type, value.var = "lag")

dt[, `:=`(timeLag = difftime(datetime, lag),
                    timeLead = difftime(lead, datetime),
                    `NA` = NULL)]

dt
#                datetime ObType                 lag                lead  timeLag timeLead
#  1: 2016-01-01 00:00:00      A                <NA>                <NA> NA hours NA hours
#  2: 2016-01-01 01:00:00      A                <NA>                <NA> NA hours NA hours
#  3: 2016-01-01 02:00:00      B 2016-01-01 01:00:00 2016-01-01 06:00:00  1 hours  4 hours
#  4: 2016-01-01 03:00:00      B 2016-01-01 01:00:00 2016-01-01 06:00:00  2 hours  3 hours
#  5: 2016-01-01 04:00:00      B 2016-01-01 01:00:00 2016-01-01 06:00:00  3 hours  2 hours
#  6: 2016-01-01 05:00:00      B 2016-01-01 01:00:00 2016-01-01 06:00:00  4 hours  1 hours
#  7: 2016-01-01 06:00:00      A                <NA>                <NA> NA hours NA hours
#  8: 2016-01-01 07:00:00      A                <NA>                <NA> NA hours NA hours
#  9: 2016-01-01 08:00:00      B 2016-01-01 07:00:00 2016-01-01 09:00:00  1 hours  1 hours
# 10: 2016-01-01 09:00:00      A                <NA>                <NA> NA hours NA hours
# 11: 2016-01-01 10:00:00      A                <NA>                <NA> NA hours NA hours
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • Thanks! I like the solution with joins. I have a set of ID's i need to repeat this with, so I'm combining it with ddply... Would it be possible if there was an ID column to also just join on 'ID' as well as 'key' ? – Bucket Feb 22 '17 at 03:14
  • @LyssBucks - of course. You might be able to replace `key` with your `ID`. I just created `key` so it all joined together. – SymbolixAU Feb 22 '17 at 03:23
  • Perfect! Works like a dream :) – Bucket Feb 22 '17 at 03:52