I have a df
in the following format and try to get a dataframe with all the pairwise combinations per group
df<-structure(list(id = c(209044052, 209044061, 209044061, 209044061,209044062, 209044062, 209044062, 209044182, 209044183, 209044295), group = c(2365686, 387969, 388978, 2365686, 387969, 388978, 2365686, 2278460, 2278460, 654238)), .Names = c("id", "group"), row.names = c(NA, -10L), class = "data.frame")
While do.call(rbind, lapply(split(df, df$group), function(i) expand.grid(i$id, i$id)))
works for a small data frame I run into time problems on my large data (~12 million obs. and ~1.5 million groups).
After some testing I recognized that the split command seems to be the bottleneck and expand.grid might also not be the fastest solution.
Found some improvements for expand.grid Use outer instead of expand.grid and some faster split alternatives here Improving performance of split() function in R? but struggle to put it all together with grouping.
Output should be something like
Var1 Var2
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044295 209044295
209044182 209044182
209044183 209044182
....
As an extra I would like to exclude repetitions of the same pair, self-reference (e.g. above 209044061 209044061
) and only keep one combination, if they are in different orders (e.g. above 209044061 209044062
and 209044062 209044061
) (Combinations without repetitions). Tried library(gtools)
with 'combinations()` but could not figure out if this slows down the calculation even more.