replicate `expand.grid()` behavior with data.frames using tidyr/data.table

Question

I am trying to speed up the base::expand.grid() function. I came across this amazing answer How to speed up `expand.grid()` in R?. However, the behavior I need relies on a data.frame passed to the base::expand.grid() function, but unfortunately, the suggested (faster) functions have slightly different behavior when receiving data.frames. For instance, this is the behavior I need.

x  <- c(.3,.6)
df <- as.data.frame(rbind(x, 1 - x))
df
##   V1  V2
## x 0.3 0.6
##   0.7 0.4
 
(base::expand.grid(df))
##   V1  V2
## 1 0.3 0.6
## 2 0.7 0.6
## 3 0.3 0.4
## 4 0.7 0.4

However, this is what I am getting out of faster functions:

library(tidyr)
library(data.table)
(tidyr::expand_grid(df))
## # A tibble: 2 × 2
##       V1    V2
## <dbl> <dbl>
##   1   0.3   0.6
##   2   0.7   0.4
##  
(tidyr::crossing(df))
# A tibble: 2 × 2
##       V1    V2
## <dbl> <dbl>
##   1   0.3   0.6
##   2   0.7   0.4

(as_tibble(data.table::CJ(df,sorted = FALSE)))
## # A tibble: 2 × 1
##       df$``   $``
## <dbl> <dbl>
##   1   0.3   0.6
##   2   0.7   0.4

Do you know how I could tweak said functions to resemble the base::expand.grid() when it received a data.frame, of course, without losing the gains in performance?

Thank you in advance!

BTW: I am already aware of the existence of:

(1) Probably reduced for the sake of the question, but expanding your sample `df` is ridiculously trivial, and corner-cases notwithstanding, I suspect all benchmarks comparing `expand.grid` to anything else will likely be unusable. (2) Comparing the performance of `expand.grid(df$V1, df$V2)` with `tidyr::expand_grid(df$V1, df$V2)` and `tidyr::expand(df, V1, V2)` show clear dominance with `expand.grid`, again likely influenced by the sample size. Ultimately, (3) why are you trying to squeeze something faster than `expand.grid`? What's the problem-set that justifies the endeavor? — r2evans, Jun 03 '22 at 13:33
You can `do.call(CJ,df)`, but as colleage @r2evans indicates, not clear why this would be faster/preferable, especially with example above — langtang, Jun 03 '22 at 13:36
Case in point: starting with `x <- seq(0, 1, len=101); df <- data.frame(V1=x, V2=1-x)`, a comparison of the three expressions in my previous comment shows that `expand.grid` is over twice as fast as `tidyr::expand_grid` and over 15x faster than `tidyr::expand`. — r2evans, Jun 03 '22 at 13:36
But to be fair, when we start with `x <- seq(0, 1, len=1001)` (producing an expansion with 1Mi rows), that's when the ratio is reversed, as `tidyr::expand` is now 1.7x faster than `expand.grid`. With this added context, I suggest you update your question to either (a) identify your expected dimensionality/size, and/or (b) perhaps give some context why you really need to squeeze something from this stone. (Even in this example, langtang's suggestion is still 7x faster than `expand.grid`, so it shows clear dominance in all samples thus far.) — r2evans, Jun 03 '22 at 13:39
Thank you, @r2evans, for your comments. If I am not mistaken, in the comparison made [here](https://stackoverflow.com/questions/68880025/how-to-speed-up-expand-grid-in-r) the `data.table::CJ` function is around twice as fast as `base::expand.grid()`. However, the behavior I want to mimic is the one that `base::expand.grid()` has when receiving a data.frame. Additionally, for context, [this is the function](https://stackoverflow.com/a/70667708/10714156) I want to speed up **drastically** because I am using around 10.000 draws. — Álvaro A. Gutiérrez-Vargas, Jun 03 '22 at 13:43
How are `do.call(CJ, df)` and `expand.grid(df)` different? Other than row order, they produce effectively-identical results. — r2evans, Jun 03 '22 at 13:47
@r2evans they are not different. However, I *just* noticed that this was what I was searching for :)! — Álvaro A. Gutiérrez-Vargas, Jun 03 '22 at 13:48
Really, then, the only difference between this and https://stackoverflow.com/questions/68880025/how-to-speed-up-expand-grid-in-r (from your comment-link) is the use of `do.call(..., df)` that ThomasIsCoding has introduced in their answer. Not strictly a dupe because of that fact, but it seems now they are very closely related. Glad you found what you needed! — r2evans, Jun 03 '22 at 13:50
A little experimenting has me thinking the fastest option within your use case function is `do.call(data.table::CJ, list(x, 1 - x))`; i.e., don't make it into a data.frame first. You might look at where the biggest slowdown is, e.g., with `profvis`. — lhs, Jun 03 '22 at 14:12

score 4 · Accepted Answer · answered Jun 03 '22 at 13:46

4

Try with do.call

> do.call(tidyr::expand_grid, df)
# A tibble: 4 x 2
     V1    V2
  <dbl> <dbl>
1   0.3   0.6
2   0.3   0.4
3   0.7   0.6
4   0.7   0.4

> do.call(tidyr::crossing, df)
# A tibble: 4 x 2
     V1    V2
  <dbl> <dbl>
1   0.3   0.4
2   0.3   0.6
3   0.7   0.4
4   0.7   0.6

> do.call(data.table::CJ, df)
    V1  V2
1: 0.3 0.4
2: 0.3 0.6
3: 0.7 0.4
4: 0.7 0.6

answered Jun 03 '22 at 13:46

ThomasIsCoding

96,636
9
24
81

1

Thank you a lot @ThomasIsCoding!! Probably a rookie question but, how could I pass the argument `sorted = FALSE` to `data.table::CJ()` when using it inside of the `do.call()` function? – Álvaro A. Gutiérrez-Vargas Jun 03 '22 at 14:12
1

`do.call(CJ, c(df,sorted=F))` – langtang Jun 03 '22 at 14:17

score 0 · Answer 2 · edited Jun 03 '22 at 17:11

0

Try tidyr::expand()

tidyr::expand(df,df[,1],df[,2])

edited Jun 03 '22 at 17:11

Wai Ha Lee

8,598
83
57
92

answered Jun 03 '22 at 13:45

Asitav Sen

56
4

2

If you're going to use `tidyr::expand` (you should be explicit about its package since it is not base R), the use of `df[,1]` defeats the intent of non-standard evaluation, and is an anti-pattern in almost all of the tidyverse. I suggest this should really be `tidyr::expand(df, V1, V2)` to be a little more relevant. – r2evans Jun 03 '22 at 13:49

replicate `expand.grid()` behavior with data.frames using tidyr/data.table

2 Answers2

Linked