expand.grid for data.frame using dplyr full_join

Question

I am trying to get a function similar to expand.grid and works on data.frame.

I found a solution in Alternative to expand.grid for data.frames that uses merge function to implement this.

Since merge is quite slow compared to dplyr alternative full_join, so I try to use full_join to implement this function, but I couldn't get it done correctly. Here is an example I failed:

df <- data.frame(attribute = paste0('attr', rep(1:5, each=2)),
                 value = paste0(rep(1:5, each=2), rep(c('A','B'), 2)),
                 score = runif(10))
df
   attribute value      score
1      attr1    1A 0.75600171
2      attr1    1B 0.07086242
3      attr2    2A 0.92403325
4      attr2    2B 0.63414169
5      attr3    3A 0.78763834
6      attr3    3B 0.88576568
7      attr4    4A 0.75998967
8      attr4    4B 0.25205845
9      attr5    5A 0.99304728
10     attr5    5B 0.70389605

I tried to split df by attribute and join the list of score together:

dfList <- df %>%
  mutate(attribute=1) %>%
  split(df$attribute)

And I "expand.grid" all these 5 tables together by:

Reduce(function(x, y) {full_join(x, y, by=c('attribute'='attribute'))}, dfList)

However, the result is weird:

   attribute value.x    score.x value.y   score.y value.x    score.x value.y   score.y value     score
1          1      1A 0.75600171      2A 0.9240333      1A 0.75600171      2A 0.9240333    5A 0.9930473
2          1      1A 0.75600171      2A 0.9240333      1A 0.75600171      2A 0.9240333    5B 0.7038961
3          1      1A 0.75600171      2A 0.9240333      1A 0.75600171      2A 0.9240333    5A 0.9930473
4          1      1A 0.75600171      2A 0.9240333      1A 0.75600171      2A 0.9240333    5B 0.7038961
...

The first 2 tables are shown twice, which is not desired. But when I try this on the first 4 tables it works perfectly:

Reduce(function(x, y) {full_join(x, y, by=c('attribute'='attribute'))}, dfList[1:4])

   attribute value.x    score.x value.y   score.y value.x   score.x value.y   score.y
1          1      1A 0.75600171      2A 0.9240333      3A 0.7876383      4A 0.7599897
2          1      1A 0.75600171      2A 0.9240333      3A 0.7876383      4B 0.2520584
3          1      1A 0.75600171      2A 0.9240333      3B 0.8857657      4A 0.7599897
4          1      1A 0.75600171      2A 0.9240333      3B 0.8857657      4B 0.2520584 
...

Where I did wrong?

I'm using dplyr 0.4.3 with R version 3.2.4 on Ubuntu 14.04

It works fine for me, straight 1-5. Also, you can just say `by = 'attribute'` if the names are the same. — alistaire, Mar 24 '16 at 06:03
What version is your `dplyr`? I'm using `0.4.3` with R version `3.2.4` — Bamqf, Mar 24 '16 at 07:51

inscaven · Accepted Answer · 2016-03-28T08:06:11.990

I can reproduce the corrupted result for your dfList on my machine. It seems to me I've found out why it's happening.

require(dplyr)

adf <- data.frame(c1 = 7, c1 = 8, jv = 1, check.names = F)
bdf <- data.frame(d1 = 1:3, d2 = letters[1:3], jv = 1)
cdf <- data.frame(v1.x = 1:3, v2 = letters[1:3], jv = 1)
ddf <- data.frame(v2 = 4:5, v2.x = letters[4:5], jv = 1)

full_join(adf, bdf, by = "jv") 

  c1 c1 jv d1 d2
1  7  7  1  1  a
2  7  7  1  2  b
3  7  7  1  3  c

We can notice that having duplicated column names in adf leads to the wrong result of joining. And when we apply a chain of several joins with the help of Reduce, the automatic renaming of duplicated column names happens (with adding .x and .y by default). This may lead to producing another duplicated names (the opposite to the thing it's intended to avoid).

full_join(cdf, ddf, by = "jv")

  v1.x v2.x jv v2.y v2.x
1    1    a  1    4    d
2    1    a  1    5    e
3    2    b  1    4    d
4    2    b  1    5    e
5    3    c  1    4    d
6    3    c  1    5    e

Here we had one duplication of names in different data.frames - column v2, which was replaced with another duplication after applying suffixes - v2.x.

So, to make things work well, we should care about unique names of columns in data.frames we're joining.

I've tried several approaches to get the desired result and want to present what they are.

base R solution using merge, it's made for speed comparison.
an approach using full_join from dplyr package
a data.table solution using sequential merge of dts
a function based on tidyr's unnest
another data.table solution which first generates the key-table of target resulting length (with the help of CJ) and then makes several left joins
the same as previous but using on parameter for joining instead of setting keys

require(data.table)
require(dplyr)
require(tidyr)
require(stringi)
require(microbenchmark)

expand.grid.df_base <- function(...) {
  dfList <- list(...)
  if (length(dfList) == 1) dfList <- dfList[[1]]

  if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))

  lapply(1:length(dfList), function(i) 
    data.frame(dfN = i, colN = 1:length(dfList[[i]]), 
               dfname = names(dfList)[i], colname = names(dfList[[i]]),
               stringsAsFactors = F)) %>% bind_rows %>%
    mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>% 
    mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names

  for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]

  Reduce(function(x, y) merge(x, y, by = NULL), dfList) %>% setNames(manage_names$out_names)
}

expand.grid.df_dplyr <- function(...) {
  dfList <- list(...)
  if (length(dfList) == 1) dfList <- dfList[[1]]

  if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))

  lapply(1:length(dfList), function(i) 
    data.frame(dfN = i, colN = 1:length(dfList[[i]]), 
               dfname = names(dfList)[i], colname = names(dfList[[i]]),
               stringsAsFactors = F)) %>% bind_rows %>%
    mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>% 
    mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names

  for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]

  joinvar <- stri_rand_strings(1, 12)

  Reduce(function(x, y) {
    mutate_def <- list(1L)
    names(mutate_def) <- joinvar

    full_join(x %>% mutate_(.dots = mutate_def), y %>% mutate_(.dots = mutate_def), by = joinvar)
  }, dfList) %>% select(-contains(joinvar)) %>% setNames(manage_names$out_names) %>% tbl_df
}

expand.grid.dt <- function(...) {
  dtList <- list(...)
  if (length(dtList) == 1) dtList <- dtList[[1]]

  if (!all(sapply(dtList, is.data.table))) dtList <- lapply(dtList, as.data.table)

  if (is.null(names(dtList))) setnames(dtList, paste0("dt", 1:length(dtList)))

  lapply(1:length(dtList), function(i) 
    data.frame(dfN = i, colN = 1:length(dtList[[i]]), 
               dfname = names(dtList)[i], colname = names(dtList[[i]]),
               stringsAsFactors = F)) %>% bind_rows %>%
    mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>% 
    mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names

  for (i in 1:nrow(manage_names)) setnames(dtList[[manage_names$dfN[i]]], old = manage_names$colN[i], new = manage_names$dum_names[i])

  joinvar <- stri_rand_strings(1, 12)

  setnames(Reduce(function(x, y) merge(copy(x)[,(joinvar) := 1], copy(y)[,(joinvar) := 1], 
                                       by = joinvar, all = T, allow.cartesian = T), dtList)[,(joinvar) := NULL],
           manage_names$out_names)[]
}

expand.grid.df_tidyr <- function(...) {
  dfList <- list(...)
  if (length(dfList) == 1) dfList <- dfList[[1]]

  if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))

  lapply(1:length(dfList), function(i) 
    data.frame(dfN = i, colN = 1:length(dfList[[i]]), 
               dfname = names(dfList)[i], colname = names(dfList[[i]]),
               stringsAsFactors = F)) %>% bind_rows %>%
    mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>% 
    mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names

  for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]

  Reduce(function(x, y) x %>% rowwise %>% mutate(dfcol = list(y)) %>% ungroup %>% unnest(dfcol), dfList) %>% 
    setNames(manage_names$out_names) %>% tbl_df
}

expand.grid.dt2 <- function(...) {
  dtList <- list(...)
  if (length(dtList) == 1) dtList <- dtList[[1]]

  dum_names <- stri_rand_strings(length(dtList), 12)

  dtList <- lapply(1:length(dtList), function(i) 
    setkeyv(as.data.table(dtList[[i]])[, (dum_names[i]) := .I], dum_names[i]))

  Reduce(function(result, dt) setkeyv(result, names(result)[1])[dt][, (names(result)[1]) := NULL],
         dtList,
         setnames(do.call(CJ, c(sapply(dtList, function(df) seq_len(nrow(df))), list(sorted = F))), dum_names))[]
}

expand.grid.dt3 <- function(...) {

  dtList <- list(...)
  if (length(dtList) == 1) dtList <- dtList[[1]]

  dum_names <- stri_rand_strings(length(dtList), 12)

  dtList <- lapply(1:length(dtList), function(i) as.data.table(dtList[[i]])[, (dum_names[i]) := .I])

  Reduce(function(result, dt) result[dt, on = names(result)[1]][, (names(result)[1]) := NULL],
         dtList,
         setnames(do.call(CJ, c(sapply(dtList, function(df) seq_len(nrow(df))), list(sorted = F))), dum_names))[]
}

Now lets create lists of data.frames for testing this functions.

set.seed(1)
bigdfList <- data.frame(type = sample(letters[1:10], 50, T),
                        categ = sample(LETTERS[1:10], 50, T),
                        num = sample(100L:500L, 50, T),
                        val = rnorm(50)) %>% split(., .$type)

smalldfList <- data.frame(type = sample(letters[1:5], 50, T),
                          categ = sample(LETTERS[1:5], 50, T),
                          num = sample(100L:500L, 50, T),
                          val = rnorm(50)) %>% split(., .$type)

The expand joinig of smalldfList produces a table of dimension [60,480 x 20] and of bigdfList - [6,451,200 x 40] which occupies 1230.5 MB of RAM.

Start with smalldfList.

microbenchmark(expand.grid.df_base(smalldfList), expand.grid.df_dplyr(smalldfList), 
               expand.grid.dt(smalldfList), expand.grid.df_tidyr(smalldfList),
               expand.grid.dt2(smalldfList), expand.grid.dt3(smalldfList), times = 10)

Unit: milliseconds
                              expr       min        lq      mean    median        uq       max neval cld
  expand.grid.df_base(smalldfList) 178.36192 188.54955 201.28729 198.79644 209.86934 229.85360    10  b 
 expand.grid.df_dplyr(smalldfList)  16.04555  16.91327  18.91094  17.64907  18.45307  29.58192    10 a  
       expand.grid.dt(smalldfList)  20.33188  21.42275  26.30034  23.22873  31.66666  39.37922    10 a  
 expand.grid.df_tidyr(smalldfList) 722.06572 738.02188 801.41820 792.23725 859.96186 905.99190    10   c
      expand.grid.dt2(smalldfList)  32.22650  33.68353  36.89386  36.39713  37.39182  48.93550    10 a  
      expand.grid.dt3(smalldfList)  29.13399  30.69299  34.51265  34.03198  37.48651  41.73543    10 a

So, tidyr solution is not an option here at all, base merge is also quite slow. Other 4 functions on the bigdfList show following efficiency.

microbenchmark(expand.grid.df_dplyr(bigdfList), expand.grid.dt(bigdfList), 
               expand.grid.dt2(bigdfList), expand.grid.dt3(bigdfList), times = 10)

Unit: seconds
                            expr       min        lq      mean    median        uq       max neval  cld
 expand.grid.df_dplyr(bigdfList)  1.326336  1.354706  1.456805  1.449781  1.481836  1.703158    10 a   
       expand.grid.dt(bigdfList)  1.763174  1.820004  1.894813  1.893910  1.939879  2.127097    10  b  
      expand.grid.dt2(bigdfList) 14.164731 14.332872 14.452933 14.452221 14.551982 14.740852    10    d
      expand.grid.dt3(bigdfList) 10.589517 10.828548 11.104010 11.021519 11.368172 11.976976    10   c

And the dplyr::full_join solution has the best result!

Maybe, it's one of the options where dplyr is really better than data.table, maybe it's my lack of data.table knowledge, which has prevented me from making a really fast function :-)

My first suggestion would be to replace `as.data.table()` calls with `setDT()`. `as.data.table()` copies the entire data.table, which is totally unnecessary. The new `on` argument allows for updating/joining without having to set keys. That could be another improvement. But I'm writing vignettes for that at the moment.. Perhaps it'd be easier to follow and use that once it's done. Will link back. — Arun, Mar 26 '16 at 09:23
thanks for your comment, @Arun, I've edited my `dt2` function using syntax with `on`, it seems to help a lot - on my `smalldfList` data it is the fastest now. I'm on PC with not enough RAM to provide all speed tests, I will edit my answer and timings on Monday — inscaven, Mar 26 '16 at 10:47
Hmm... On another machine with more RAM but slower CPU I face quite different results, function `dt3` (based on joining with `on`) is better than `dt2` but remains underdog to `dplyr::full_join` and `merge.data.table` solutions. BTW, I use `as.data.table` instead of `setDT` not to modify initial data, because as I understand, `dplyr` will operate over `data.tables` using `data.table` package functions, so comparison between `df_dplyr` and `dt` will loose it's sense. Besides, copying happens only once, and the size of initial data is quite small, so it's not a bottleneck here. — inscaven, Mar 28 '16 at 08:24
Thanks for checking. Comparisons should be about the method (implementations), not about data conversions (overhead). `data.table` package is designed to operate on `data.tables`, and provides methods to directly start with one. The join method itself, I think, is quite quick here. I'm afraid therefore it's the `as.data.table()` overhead that you're measuring. This is also why it's better to benchmark on meaningful data sizes + timings (slightly larger). Any how, good luck :-). — Arun, Mar 28 '16 at 08:32

expand.grid for data.frame using dplyr full_join

1 Answers1