split and expand.grid by group on large data set

Question

I have a df in the following format and try to get a dataframe with all the pairwise combinations per group

df<-structure(list(id = c(209044052, 209044061, 209044061, 209044061,209044062, 209044062, 209044062, 209044182, 209044183, 209044295), group = c(2365686, 387969, 388978, 2365686, 387969, 388978, 2365686, 2278460, 2278460, 654238)), .Names = c("id", "group"), row.names = c(NA, -10L), class = "data.frame")

While do.call(rbind, lapply(split(df, df$group), function(i) expand.grid(i$id, i$id))) works for a small data frame I run into time problems on my large data (~12 million obs. and ~1.5 million groups).

After some testing I recognized that the split command seems to be the bottleneck and expand.grid might also not be the fastest solution.

Found some improvements for expand.grid Use outer instead of expand.grid and some faster split alternatives here Improving performance of split() function in R? but struggle to put it all together with grouping.

Output should be something like

  Var1      Var2
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044295 209044295
209044182 209044182
209044183 209044182
....

As an extra I would like to exclude repetitions of the same pair, self-reference (e.g. above 209044061 209044061) and only keep one combination, if they are in different orders (e.g. above 209044061 209044062and 209044062 209044061) (Combinations without repetitions). Tried library(gtools) with 'combinations()` but could not figure out if this slows down the calculation even more.

Potentially `data.table`? `library(data.table); setDT(df)[, expand.grid(id, id), by = group]` — Mike H., Oct 25 '17 at 16:55

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

One possible solution which avoids repetitions of the same pair as well as different orders is using the data.table and combinat packages:

library(data.table)
setDT(df)[order(id), data.table(combinat::combn2(unique(id))), by = group]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

order(id) is used here just for convenience to better check the results but can be skipped in production code.

Replace `combn2()` by a non-equi join

There is another approach where the call to combn2() is replaced by a non-equi join:

mdf <- setDT(df)[order(id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
    allow.cartesian = TRUE]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

Note that the non-equi join requires the data to be ordered.

Benchmark

The second method seems to be much faster

# create benchmark data
nr <- 1.2e5L # number of rows
rg <- 8L # number of ids within each group
ng <- nr / rg # number of groups
set.seed(1L)
df2 <- data.table(
  id = sample.int(rg, nr, TRUE),
  group = sample.int(ng, nr, TRUE)
)

#benchmark code
microbenchmark::microbenchmark(
  combn2 = df2[order(group, id), data.table((combinat::combn2(unique(id)))), by = group],
  nej = {
    mdf <- df2[order(group, id), unique(id), by = group]
    mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
        allow.cartesian = TRUE]},
  times = 1L)

For 120000 rows and 14994 groups the timings are:

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 combn2 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115     1
    nej   137.3228   137.3228   137.3228   137.3228   137.3228   137.3228     1

Caveat

As pointed out by the OP the number of id per group is crucial in terms of memory consumption and speed. The number of combinations is of O(n²), exactly n * (n-1) / 2 or choose(n, 2L) if n is the number of ids.

The size of the largest group can be found by

df2[, uniqueN(id), by = group][, max(V1)]

The total number of rows in the final result can be computed in advance by

df2[, uniqueN(id), by = group][, sum(choose(V1, 2L))]

run into `negative length vectors are not allowed` which is due to memory? @Uwe — CER, Oct 25 '17 at 17:24
Have you tried the non-equi join version? Using the artificial benchmark data with 12 M rows, 1,5 M groups it took 14 seconds without memory issues. What is the maximum number of `id` per `group`, e.g., `df2[, uniqueN(id), by = group][, max(V1)]`? — Uwe, Oct 25 '17 at 18:10
Wow worked out now. The hint to check for the id per groups helped. The biggest group (200000 ids per grp) was too much to handle I assume bc once I took it out your code run through. Maybe that could be an addition as future OPs might run into this large group problem — CER, Oct 26 '17 at 01:13
I have added a caveat to check the size of the largest group and to compute the number of rows in the final result *in advance*. Thanks for suggesting. — Uwe, Oct 26 '17 at 07:32

split and expand.grid by group on large data set

1 Answers1

Replace `combn2()` by a non-equi join

Benchmark

Caveat

Linked

split and expand.grid by group on large data set

1 Answers1

Replace combn2() by a non-equi join

Benchmark

Caveat

Linked

Replace `combn2()` by a non-equi join