R get combination of rows by group using data.table (for inputting in igraph)

Question

I have a data.table like this:

dt<-data.table(group=(c(rep("A", 4), rep("B", 3), rep("C", 2))),
       individual=c("Ava", "Bill", "Claire", "Daniel", "Evelyn", "Francis", "Grant", "Helen", "Ig"))

I would like to change something like this:

dt2<-data.table(group=(c(rep("A", 6), rep("B", 3), rep("C", 1))), edge1=c("Ava", "Ava", "Ava", "Bill", "Bill", "Claire", "Evelyn", "Evelyn", "Francis", "Helen"), edge2=c("Bill", "Claire", "Daniel", "Claire", "Daniel", "Daniel", "Francis", "Grant", "Grant", "Ig"))

Basically, each row of the second table takes "a combination of two individuals by groups" in the first table. The whole idea is to input data into igraph for network analysis. If there are any better solutions for this purpose they are more than welcome.

MichaelChirico · Accepted Answer · 2020-01-12T18:20:19.037

Thanks to @mt1022, who helps highlight that the implementation of combn in base R is very slow (it's implemented in R). So, we can take approaches from this Q&A about speeding up combn to make this approach more efficient. I couldn't get gRbase to install on my machine, so I took the code from comb2.int and dropped it in to my approach:

dt[ , {
  edge1 = rep(1:.N, (.N:1) - 1L)
  i = 2L:(.N * (.N - 1L) / 2L + 1L)
  o = cumsum(c(0L, (.N-2L):1))
  edge2 = i - o[edge1]
  .(edge1 = edge1, edge2 = edge2)
}, by = group]

This speeds up the process substantially on a beefed up version of the OP's data set:

max_g = 1000L
dt = data.table(
  group = rep(LETTERS, sample(max_g, 26L, TRUE))
)
dt[ , individual := as.character(.I)]

library(microbenchmark)
microbenchmark(
  times = 10L,
  combn = dt[ , transpose(combn(individual, 2L, simplify = FALSE)), by = group],
  cj = dt[ , CJ(edge1 = individual, edge2 = individual), by = group
           ][edge1 < edge2],
  fast_combn = dt[ , {
    edge1 = rep(1:.N, (.N:1) - 1L)
    i = 2L:(.N * (.N - 1L) / 2L + 1L)
    o = cumsum(c(0L, (.N-2L):1))
    edge2 = i - o[edge1]
    .(edge1 = edge1, edge2 = edge2)
  }, by = group]
)
# Unit: milliseconds
#        expr       min        lq     mean    median        uq       max neval
#       combn 3075.8078 3247.8300 3905.831 3482.9950 4289.8168 6180.1138    10
#          cj 2495.1798 2549.1552 3830.492 4014.6591 4959.2004 5239.7905    10
#  fast_combn  180.1348  217.9098  294.235  284.8854  329.5982  493.4744    10

That is, while the original combn approach and that suggested with CJ are about neck-and-neck depending on data characteristics, this approach is far and away better on large data.

Original approach with just `combn`

We can use combn like so:

dt2 = dt[ , transpose(combn(individual, 2L, simplify = FALSE)), by = group]

By default, combn will return a 2 x n matrix, where n = choose(.N, 2) and .N is each group's size.

simplify = FALSE will instead return a length-n list of tuples; transpose converts this to a length-2 list of n-tuples (efficiently).

Then fix the names:

setnames(dt2, c('V1', 'V2'), c('edge1', 'edge2'))

So neat. I did try `combn` and was getting errors; the trick was in `transpose`. Thanks a lot! — wyatt, Jun 06 '18 at 03:33
@yongwookLee just a matter of playing around with the output -- the key thing to remember about `data.table` is that if `j` returns a list with `k` elements, there will be `k` columns. So we just have to manipulate the output of `combn` to fit that. — MichaelChirico, Jun 06 '18 at 03:36
@yongwookLee that said, I seem to recall there being a more automatic way of constructing such edges in `igraph`, but i'm not facile enough with the package to know off-hand. — MichaelChirico, Jun 06 '18 at 03:39
The `fast_combn` is amazing! I edited my answer to recomment this approach now. — mt1022, Jun 06 '18 at 07:31
Added a naive `Rcpp` implementation of `combn` :) see the edited answer! — mt1022, Jun 06 '18 at 13:23

score 5 · Answer 2 · edited Jun 20 '20 at 09:12

You can achieve it with CJ:

dt[, CJ(edge1 = individual, edge2 = individual), by = group][edge1 < edge2]
#     group   edge1   edge2
#  1:     A     Ava    Bill
#  2:     A     Ava  Claire
#  3:     A     Ava  Daniel
#  4:     A    Bill  Claire
#  5:     A    Bill  Daniel
#  6:     A  Claire  Daniel
#  7:     B  Evelyn Francis
#  8:     B  Evelyn   Grant
#  9:     B Francis   Grant
# 10:     C   Helen      Ig

Discussion

As noted by MichaelChirico, this will require more memory. For a group of size n, CJ will create n^2 rows, while combn will create n(n-1)/2 rows. The ratio is n^2 / (n(n-1)/2) = 2n/(n-1) ~ 2.

For an approach that are more efficient in both memory and speed, see the fast_combn in MiclaelChirico's answer.

Edit

Added a Rcpp implementation of combn by enumeration:

library(Rcpp)
cppFunction(
    'List combnCpp(CharacterVector x) {
    const int n = x.size();
    x.sort();
    CharacterVector combn1 = CharacterVector(n*(n-1)/2);
    CharacterVector combn2 = CharacterVector(n*(n-1)/2);
    int idx = 0;
    for(int i = 0; i < n - 1; i++) {
        for(int j = i + 1; j < n; j++){
            combn1[idx] = x[i];
            combn2[idx] = x[j];
            idx++;
        }
    }
    return List::create(_["V1"] = combn1, _["V2"] = combn2);
}')

combnCpp = dt[ , combnCpp(individual), by = group]

Here is a benchmark using @MichaelChirico's code:

library(data.table)
max_g = 1e3
set.seed(123)
dt = data.table(
    group = rep(LETTERS, sample(max_g, 26, TRUE))
)
dt[ , individual := as.character(.I)]

library(gRbase)
library(microbenchmark)
microbenchmark(
    times = 10L,
    cpp_combn = dt[ , combnCpp(individual), by = group],
    gRbase = dt[ , transpose(combnPrim(individual, 2, simplify = FALSE)), by = group],
    CJ = dt[ , CJ(edge1 = individual, edge2 = individual), by = group][edge1 < edge2],
    fast_combn = dt[ , {
        edge1 = rep(1:.N, (.N:1) - 1L)
        i = 2L:(.N * (.N - 1L) / 2L + 1L)
        o = cumsum(c(0, (.N-2L):1))
        edge2 = i - o[edge1]
        .(edge1 = edge1, edge2 = edge2)
    }, by = group]
)
# Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval
#   cpp_combn  247.6795  284.3614  324.2149  305.1760  347.1372  499.9442    10
#      gRbase 1115.0338 1299.2865 1341.3890 1339.3950 1378.6571 1517.2534    10
#          CJ 1455.2715 1481.8725 1630.0190 1616.7780 1754.3922 1879.5768    10
#  fast_combn  128.5774  153.4234  215.5325  166.7491  319.1567  363.3657    10

The combnCpp is still ~2 times slower than fast_combn, which might be due to the fact that combnCpp is doing enumeration, while fast_combn is doing computation. A possible improvement for combnCpp would be calculating indices as fast_combndoes rather than doing enumeration.

awesome! have you tried gRbase function mentioned in my linked answer? I'm pretty sure it's done in rcpp & have to assume it's pretty efficient — MichaelChirico, Jun 06 '18 at 13:26
@MichaelChirico, see the updated benchmark. It seems that `combnPrim` approach is much slower. I am not sure whether this is due to the `transpose` step. — mt1022, Jun 06 '18 at 13:41
when investigating my own original answer, I used the code profiler built in to rstudio, it's quite convenient; I suspected `transpose` was at fault but in my case `combn` was the clear culprit — MichaelChirico, Jun 06 '18 at 16:48

R get combination of rows by group using data.table (for inputting in igraph)

2 Answers2

Original approach with just combn

Edit

Original approach with just `combn`