data.table non-equi join with min/max of joined value

Question

I'm trying to do a non-equi join in data.table and extract the min/max of the joined values in that join.

set.seed(42)
dtA <- data.table(id=rep(c("A","B"),each=3), start=rep(1:3, times=2), end=rep(2:4, times=2))
dtB <- data.table(id=rep(c("A","B"),times=20), time=sort(runif(40, 1, 4)))

I'd like to preserve the min/max values of time when it is between start and end (and on id). Nominally this is just a non-equi join, but I can't find a combination of by=.EACHI or mult="..." that gets me what I want. Instead, the min/max typically are not aligned with the range I need. Unfortunately roll= does not support non-equi ranges.

dtA[dtB, c("Min", "Max") := .(min(time), max(time)),
    on = .(id, start <= time, end > time), mult = "first"]
#        id start   end      Min      Max
#    <char> <int> <int>    <num>    <num>
# 1:      A     1     2 1.011845 3.966675
# 2:      A     2     3 1.011845 3.966675
# 3:      A     3     4 1.011845 3.966675
# 4:      B     1     2 1.011845 3.966675
# 5:      B     2     3 1.011845 3.966675
# 6:      B     3     4 1.011845 3.966675
dtA[dtB, c("Min", "Max") := .(min(time), max(time)),
    on = .(id, start <= time, end > time), by = .EACHI]
#        id start   end      Min      Max
#    <char> <int> <int>    <num>    <num>
# 1:      A     1     2 1.858419 1.858419
# 2:      A     2     3 2.970977 2.970977
# 3:      A     3     4 3.934679 3.934679
# 4:      B     1     2 1.766286 1.766286
# 5:      B     2     3 2.925237 2.925237
# 6:      B     3     4 3.966675 3.966675

The second is the closest ("Max" is correct), but what I'd like to be able to get is:

       id start   end      Min      Max
   <char> <num> <int>    <num>    <num>
1:      A     1     2 1.011845 1.858419
2:      A     2     3 2.170610 2.970977
3:      A     3     4 3.115194 3.934679
4:      B     1     2 1.022002 1.766286
5:      B     2     3 2.164325 2.925237
6:      B     3     4 3.055509 3.966675

The real problem has around 400K or so rows with ranges joining in 2Mi rows of values, so I'd prefer to not do a full expansion of both frames to manually cut it back down to the size of dtA.

(I'm open to collapse suggestions.)

Does it have to be all in one line, or can I suggest doing it in two sweeps - `dtA[dtB[order(id,-time)], on = .(id, start <= time, end > time), min := time];dtA[dtB[order(id,time)], on = .(id, start <= time, end > time), max := time]` ? — thelatemail, Mar 14 '23 at 22:24
The `tidyverse` equivalent `left_join(as.data.frame(dtA), as.data.frame(dtB), by=join_by(id, start < time, end > time)) %>% summarize(Min = min(time), Max = max(time), .by=c(id, start, end))`. Definitely slower because no threading, but also quite efficient. — Andre Wildberg, Mar 14 '23 at 23:32
Thanks Andre. While simple, this does what I'm trying to not do: a full expansion. I also don't have dplyr-1.1.0 yet for other reasons, but I appreciate your suggestion! — r2evans, Mar 15 '23 at 02:20
Are the by-group intervals in your real data mutually exclusive as in the example? — jblood94, Mar 15 '23 at 13:34
@jblood94 in the real data, `start` and `end` for each `id` abut, meaning `start == shift(end)`. It's time-ranges (`POSIXt`) in `dtA` and observations in `dtB`, and I need to find various stats (min, max, first, last, etc) of a dozen or so numeric columns. The end-result is going into a user-interactive app, so I want to get the runtime down as low as I can; part of that is reduction of the data going in, but the rest is just raw computation. (To make it worse, some of the fields are lat/lon, requiring either UTM-conversion and back, or `geosphere`-math, not cheap.) — r2evans, Mar 15 '23 at 13:45

thelatemail · Accepted Answer · 2023-03-15T01:33:55.683

Switch the join around so it's B[A], and then assign inside A:

dtA[,
    c("min","max") := dtB[
        dtA,
        on=.(id, time >= start, time < end), 
        .(min=min(x.time), max=max(x.time)),
        by=.EACHI][, c("min","max")]
    ]
dtA

#       id start   end      min      max
#   <char> <int> <int>    <num>    <num>
#1:      A     1     2 1.011845 1.858419
#2:      A     2     3 2.170610 2.970977
#3:      A     3     4 3.115194 3.934679
#4:      B     1     2 1.022002 1.766286
#5:      B     2     3 2.164325 2.925237
#6:      B     3     4 3.055509 3.966675

You can see it needs to be spun around, or otherwise the .EACHI group ends up being for each individual row in B instead of the matching rows in B inside the criteria for A:

dtB[dtA, on=.(id, time >= start, time < end), .N, by=.EACHI]
#       id  time  time     N
#   <char> <int> <int> <int>
#1:      A     1     2     5
#2:      A     2     3     6
#3:      A     3     4     9
#4:      B     1     2     4
#5:      B     2     3     6
#6:      B     3     4    10

dtA[dtB, on=.(id, start <= time, end > time), .N, by=.EACHI][, .(freq=.N), by=N]
#       N  freq
#   <int> <int>
#1:     1    40

This makes sense in the context of the help("data.table::special-symbols") description:

Its usage is 'by=.EACHI' (or 'keyby=.EACHI') which invokes grouping-by-each-row-of-i

In the DT[i, j, by] logic, dtA then needs to contribute the rows for the grouping.

Looking, will check back later. Thanks! – r2evans Mar 15 '23 at 02:21 — r2evans, Mar 15 '23 at 02:21

jblood94 · Answer 2 · 2023-03-16T17:48:16.917

Update based on comments

Joins can be avoided altogether by using nafill after row-binding dtA to dtB and sorting. The solution below runs potentially orders of magnitude faster than a .EACHI non-equi join at the cost of considerably increased code complexity.

Unlike the previous answer, this makes no assumption about overlapping/non-overlapping intervals.

i <- 1:nrow(dtA)
dtA[
    , c("Min", "Max") := setorder(
      rbindlist(
        list( # list order is important
          dtA[, .(id, time = end, rw = .I, isStart = -1L)],
          dtA[, .(id, time = start, rw = .I, isStart = 1L)],
          dtB[, `:=`(rw = NA_integer_, isStart = 0L)]
        )
      ), id, time
    )[ # set the time at start/end points to NA
      !is.na(rw), time := NA_real_
    ][ # fill forward to get the max in each interval
      isStart <= 0L, time := nafill(time, "locf")
    ][ # fill backward to get the min in each interval
      isStart >= 0L, time := nafill(time, "nocb")
      # get the max and min, ordered by dtA row
    ][!is.na(rw), setorder(.SD, -isStart, rw)[, .(time[i], time[-i])]]
  # remove the no matches
  ][Min >= end | Min < start, c("Min", "Max") := NA_real_]

dtA
#>    id start end      Min      Max
#> 1:  A     1   2 1.011845 1.858419
#> 2:  A     2   3 2.170610 2.970977
#> 3:  A     3   4 3.115194 3.934679
#> 4:  B     1   2 1.022002 1.766286
#> 5:  B     2   3 2.164325 2.925237
#> 6:  B     3   4 3.055509 3.966675

As a function, along with a function for the .EACHI approach from this answer:

fsort <- function(dtA, dtB) {
  i <- 1:nrow(dtA)
  dtA[
    , c("Min", "Max") := setorder(
      rbindlist(
        list(
          dtA[, .(id, time = end, rw = .I, isStart = -1L)],
          dtA[, .(id, time = start, rw = .I, isStart = 1L)],
          dtB[, `:=`(rw = NA_integer_, isStart = 0L)]
        )
      ), id, time
    )[
      !is.na(rw), time := NA_real_
    ][
      isStart <= 0L, time := nafill(time, "locf")
    ][
      isStart >= 0L, time := nafill(time, "nocb")
    ][!is.na(rw), setorder(.SD, -isStart, rw)[, .(time[i], time[-i])]]
  ][Min >= end | Min < start, c("Min", "Max") := NA_real_]
}

fEACHI <- function(dtA, dtB) {
  dtA[
    , c("Min","Max") := dtB[
      dtA,
      on = .(id, time >= start, time < end), 
      .(Min=min(x.time), Max=max(x.time)),
      by=.EACHI][, c("Min","Max")]
  ]
}

Benchmarking a much larger dataset:

dtA <- data.table(id = rep(c("A", "B"), 2e5), start = sort(runif(4e5, 0, 2e5)), end = sort(runif(4e5, 0, 2e5)))
dtA[start > end, c("end", "start") := .(start, end)]
dtB <- data.table(id = rep(c("A", "B"), 1e6), time = sort(runif(2e6, 0, 2e5)))
setorder(dtA, id, start)

microbenchmark::microbenchmark(
  sort = fsort(dta, dtB),
  .EACHI = fEACHI(dta, dtB),
  setup = dta <- copy(dtA),
  check = "equal",
  times = 10
)
#> Unit: milliseconds
#>    expr       min       lq      mean    median        uq       max neval
#>    sort  258.6646  270.124  295.0626  304.3474  308.0906  322.8272    10
#>  .EACHI 2065.2458 2094.445 2115.2688 2104.5641 2152.5112 2157.1872    10

fEACHI gets increasingly slow as the interval widths increase, while fsort scales very nicely. For the example dataset below, fsort still takes a fraction of a second, while fEACHI explodes to 400 seconds.

dtA <- data.table(id = rep(c("A", "B"), 2e5), start = runif(4e5, 0, 2e5), end = runif(4e5, 0, 2e5))
dtA[start > end, c("end", "start") := .(start, end)]
dtB <- data.table(id = rep(c("A", "B"), 1e6), time = sort(runif(2e6, 0, 2e5)))
setorder(dtA, id, start)

system.time(dt1 <- fsort(copy(dtA), dtB))
#>    user  system elapsed 
#>    0.38    0.08    0.26
system.time(dt2 <- fEACHI(copy(dtA), dtB))
#>    user  system elapsed 
#>  520.44   23.03  399.72
identical(dt1, dt2)
#> [1] TRUE

As I dive into this, I think the caveat of non-overlapping ranges might need to be **bolded** for future readers — r2evans, Mar 15 '23 at 20:44
This is a really interesting approach @jblood94, thank you for providing it! It's definitely a different way to approach a large-ish problem. Unfortunately for me, the sample data I provided does suggest non-overlapping per ID, but that is not guaranteed in the real data, there are indeed overlapping regions of time. I tried for a bit to digest your code and adapt to this problem, but I think it's unlikely to work given how I _think_ you are doing it. Thank you again for the contribution! Clearly a case of over-simplifying sample data (on my part). — r2evans, Mar 15 '23 at 21:06
It's still quite doable without the assumption, just simpler with. I'll maybe take another crack at it tomorrow. — jblood94, Mar 16 '23 at 00:52
@r2evans, see the updated answer. Removing the assumption wasn't too bad. — jblood94, Mar 16 '23 at 17:42

score 2 · Answer 3 · answered Mar 15 '23 at 13:24

Ultimately I think it's not possible to resolve this without doing a full expansion somewhere, so the next wicket would be runtime. With this 6-row sample, thelatemail's response was faster, but as the data scales to larger datasets the use of collapse gains parity, and at my current data it takes half the time (which is still on the order of seconds).

Also, I've changed the code below to use first/last instead of min/max, another aspect of the data I'm using (analogous in almost every way as far as performance and data-management goes).

thelatemail's:

dtA[,
    c("min","max") := dtB[
      dtA,
      on=.(id, time >= start, time < end),
      .(min=first(x.time), max=last(x.time)),
      by=.EACHI][, c("min","max")]
    ]

equivalent in collapse:::

joined <- dtB[, tm := time][dtA, on = .(id, tm >= start, tm < end)]
setnames(joined, c("tm", "tm.1"), c("start", "end"))
g <- collapse::GRP(joined, c("id", "start", "end"))
dtA[
  cbind(collapse::ffirst(joined[,.(time)], g = g),
        collapse::flast(joined[,.(time)], g = g),
        g$groups),
  on = c("id", "start", "end")]

AndreWildberg's recommend for a pure dplyr (>= 1.1.0) version is also notable, though I can't benchmark it at the moment (I'm still working with < 1.1.0).

data.table non-equi join with min/max of joined value

3 Answers3

Update based on comments