For the sake of completeness, this can be achieved as well using base R by
init[, .(id1 = rep(id, rev(seq_along(id) - 1L)),
id2 = unlist(lapply(-seq_along(id), tail, x = id)))]
Benchmark
As the OP has mentioned
My real data is huge, so I'm looking for an efficient solution.
here is a benchmark comparing
- cj: CJ() with subsequent filtering
- sj: langtang's non-equi self-join
- rp: above base R approach using
rep()
and tail()
for problem sizes (length of input vector) of 10, 100, 1000, and 10k rows. Note that for the 10k case the result contains nearly 50 million combinations. So available memory is a limiting factor.
Note that bench::mark()
verifies that all approaches return the same result.
library(bench)
library(data.table)
bm <- press(
n_id = 10^(4:1),
{
init0 <- data.table(id = sprintf("%09i", seq(n_id)))
cat(sprintf("Number of combinations in result: %g\n", (n_id-1)*n_id/2))
mark(
rp = {
init <- copy(init0)
init[, .(id1 = rep(id, rev(seq_along(id) - 1L)),
id2 = unlist(lapply(-seq_along(id), tail, x = id)))]
},
sj = {
init <- copy(init0)
init[, id2 := as.factor(id)][
init, on = .(id2 > id2), nomatch = 0, .(id1 = as.character(id2), id2 = id)]
},
cj = {
init <- copy(init0)
init[, CJ(id1 = id, id2 = id)][id1 < id2]
},
check = function(x, y) {
tmp <- all.equal(x, y, check.attributes = FALSE)
if (!isTRUE(tmp)) {
cat(tmp, "\n")
str(x)
str(y)
}
tmp
},
min_iterations = 1L
)
}
)
bm
expression n_id min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 rp 10000 1.8s 1.8s 0.556 2.05GB 0.556 1 1 1.8s <data.t… <Rprof… <ben…
2 sj 10000 2.87s 2.87s 0.348 2.24GB 0.696 1 2 2.87s <data.t… <Rprof… <ben…
3 cj 10000 49.29s 49.29s 0.0203 3.17GB 0.0203 1 1 49.29s <data.t… <Rprof… <ben…
4 rp 1000 22.11ms 22.53ms 44.2 21.15MB 2.10 21 1 475.58ms <data.t… <Rprof… <ben…
5 sj 1000 18.28ms 18.88ms 52.4 23.19MB 2.02 26 1 496.27ms <data.t… <Rprof… <ben…
6 cj 1000 542.97ms 542.97ms 1.84 32.53MB 0 1 0 542.97ms <data.t… <Rprof… <ben…
7 rp 100 1.46ms 1.52ms 651. 285.94KB 6.56 298 3 457.53ms <data.t… <Rprof… <ben…
8 sj 100 2.14ms 2.21ms 450. 406.48KB 4.24 212 2 471.37ms <data.t… <Rprof… <ben…
9 cj 100 6.92ms 6.97ms 143. 431.23KB 0 72 0 504.44ms <data.t… <Rprof… <ben…
10 rp 10 679.8µs 738.38µs 1343. 66.34KB 6.60 610 3 454.25ms <data.t… <Rprof… <ben…
11 sj 10 1.71ms 1.81ms 552. 156.17KB 4.22 262 2 474.47ms <data.t… <Rprof… <ben…
12 cj 10 915.48µs 967.99µs 1029. 100.83KB 4.21 489 2 475.13ms <data.t… <Rprof… <ben…
library(ggplot2)
autoplot(bm)

Conclusion
cj
is magnitudes slower than the other 2 approaches for larger use cases while rp
is on a par with sj
but allocates slightly less memory.