Since version v1.9.8 (on CRAN 25 Nov 2016), non-equi joins are possible with data.table
which can be utilized here.
In addition, OP's approach creates "symmetric duplicates" (a, b) and (b, a). Avoiding duplicates would halfen the size of the result set without loss of information (compare ?combn
)
If this is the intention of the OP we can use non-equi joins to avoid those symmetric duplicates:
library(data.table)
dt[, rn := .I][dt, on = .(date, rn < rn), nomatch = 0L][value + i.value < 4]
which gives
date contract value rn i.contract i.value
1: 2013-09-24 a 1 1501 b 2
2: 2013-09-25 a 1 1502 b 2
3: 2013-09-26 a 1 1503 b 2
4: 2013-09-27 a 1 1504 b 2
5: 2013-09-28 a 1 1505 b 2
---
1496: 2017-10-28 a 1 2996 b 2
1497: 2017-10-29 a 1 2997 b 2
1498: 2017-10-30 a 1 2998 b 2
1499: 2017-10-31 a 1 2999 b 2
1500: 2017-11-01 a 1 3000 b 2
as opposed to the result using OP's code
date contract value i.contract i.value
1: 2013-09-24 b 2 a 1
2: 2013-09-24 a 1 b 2
3: 2013-09-25 b 2 a 1
4: 2013-09-25 a 1 b 2
5: 2013-09-26 b 2 a 1
---
2996: 2017-10-30 a 1 b 2
2997: 2017-10-31 b 2 a 1
2998: 2017-10-31 a 1 b 2
2999: 2017-11-01 b 2 a 1
3000: 2017-11-01 a 1 b 2
The next step is to further reduce the number of pairs created which are need to be filtered out afterwards:
dt[, val4 := 4 - value][dt, on = .(date, rn < rn, val4 > value), nomatch = 0L]
which returns the same result as above.
Note that filter condition value + i.value < 4
is replaced by another join condition val4 > value
where val4
is an especially created helper column.
Benchmark
For a benchmark case of n <- 150000L
resulting in 450 k rows in dt
the timings are:
n <- 150000L
dt <- data.table(date = rep(seq(Sys.Date() - n+1, Sys.Date(), by = "1 day"), 3),
contract = c(rep("a", n), rep("b", n), rep("c", n)),
value = c(rep(1, n), rep(2, n), rep(3, n)))
dt0 <- copy(dt)
microbenchmark::microbenchmark(
OP = {
dt <- copy(dt0)
dt[dt, on = .(date), allow.cartesian = TRUE][
(contract != i.contract) & (value + i.value < 4)]
},
nej1 = {
dt <- copy(dt0)
dt[, rn := .I][dt, on = .(date, rn < rn), nomatch = 0L][value + i.value < 4]
},
nej2 = {
dt <- copy(dt0)
dt[, rn := .I][, val4 := 4 - value][dt, on = .(date, rn < rn, val4 > value), nomatch = 0L]
},
times = 20L
)
Unit: milliseconds
expr min lq mean median uq max neval cld
OP 136.3091 143.1656 246.7349 298.8648 304.8166 311.1141 20 b
nej1 127.9487 133.1772 160.8096 136.0825 146.0947 298.3348 20 a
nej2 180.4189 183.9264 219.5171 185.9385 198.7846 351.3038 20 b
So, doing the check value + i.value < 4
after the join seems to be faster than including it in the non-equi join.