Thanks to @mt1022, who helps highlight that the implementation of combn
in base
R is very slow (it's implemented in R). So, we can take approaches from this Q&A about speeding up combn
to make this approach more efficient. I couldn't get gRbase
to install on my machine, so I took the code from comb2.int
and dropped it in to my approach:
dt[ , {
edge1 = rep(1:.N, (.N:1) - 1L)
i = 2L:(.N * (.N - 1L) / 2L + 1L)
o = cumsum(c(0L, (.N-2L):1))
edge2 = i - o[edge1]
.(edge1 = edge1, edge2 = edge2)
}, by = group]
This speeds up the process substantially on a beefed up version of the OP's data set:
max_g = 1000L
dt = data.table(
group = rep(LETTERS, sample(max_g, 26L, TRUE))
)
dt[ , individual := as.character(.I)]
library(microbenchmark)
microbenchmark(
times = 10L,
combn = dt[ , transpose(combn(individual, 2L, simplify = FALSE)), by = group],
cj = dt[ , CJ(edge1 = individual, edge2 = individual), by = group
][edge1 < edge2],
fast_combn = dt[ , {
edge1 = rep(1:.N, (.N:1) - 1L)
i = 2L:(.N * (.N - 1L) / 2L + 1L)
o = cumsum(c(0L, (.N-2L):1))
edge2 = i - o[edge1]
.(edge1 = edge1, edge2 = edge2)
}, by = group]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# combn 3075.8078 3247.8300 3905.831 3482.9950 4289.8168 6180.1138 10
# cj 2495.1798 2549.1552 3830.492 4014.6591 4959.2004 5239.7905 10
# fast_combn 180.1348 217.9098 294.235 284.8854 329.5982 493.4744 10
That is, while the original combn
approach and that suggested with CJ
are about neck-and-neck depending on data characteristics, this approach is far and away better on large data.
Original approach with just combn
We can use combn
like so:
dt2 = dt[ , transpose(combn(individual, 2L, simplify = FALSE)), by = group]
By default, combn
will return a 2 x n
matrix, where n = choose(.N, 2)
and .N
is each group's size.
simplify = FALSE
will instead return a length-n
list
of tuples; transpose
converts this to a length-2
list
of n
-tuples (efficiently).
Then fix the names:
setnames(dt2, c('V1', 'V2'), c('edge1', 'edge2'))