2

I am trying to speed up the base::expand.grid() function. I came across this amazing answer How to speed up `expand.grid()` in R?. However, the behavior I need relies on a data.frame passed to the base::expand.grid() function, but unfortunately, the suggested (faster) functions have slightly different behavior when receiving data.frames. For instance, this is the behavior I need.

x  <- c(.3,.6)
df <- as.data.frame(rbind(x, 1 - x))
df
##   V1  V2
## x 0.3 0.6
##   0.7 0.4
 
(base::expand.grid(df))
##   V1  V2
## 1 0.3 0.6
## 2 0.7 0.6
## 3 0.3 0.4
## 4 0.7 0.4

However, this is what I am getting out of faster functions:

library(tidyr)
library(data.table)
(tidyr::expand_grid(df))
## # A tibble: 2 × 2
##       V1    V2
## <dbl> <dbl>
##   1   0.3   0.6
##   2   0.7   0.4
##  
(tidyr::crossing(df))
# A tibble: 2 × 2
##       V1    V2
## <dbl> <dbl>
##   1   0.3   0.6
##   2   0.7   0.4

(as_tibble(data.table::CJ(df,sorted = FALSE)))
## # A tibble: 2 × 1
##       df$``   $``
## <dbl> <dbl>
##   1   0.3   0.6
##   2   0.7   0.4

Do you know how I could tweak said functions to resemble the base::expand.grid() when it received a data.frame, of course, without losing the gains in performance?

Thank you in advance!


BTW: I am already aware of the existence of:

  • 2
    (1) Probably reduced for the sake of the question, but expanding your sample `df` is ridiculously trivial, and corner-cases notwithstanding, I suspect all benchmarks comparing `expand.grid` to anything else will likely be unusable. (2) Comparing the performance of `expand.grid(df$V1, df$V2)` with `tidyr::expand_grid(df$V1, df$V2)` and `tidyr::expand(df, V1, V2)` show clear dominance with `expand.grid`, again likely influenced by the sample size. Ultimately, (3) why are you trying to squeeze something faster than `expand.grid`? What's the problem-set that justifies the endeavor? – r2evans Jun 03 '22 at 13:33
  • You can `do.call(CJ,df)`, but as colleage @r2evans indicates, not clear why this would be faster/preferable, especially with example above – langtang Jun 03 '22 at 13:36
  • Case in point: starting with `x <- seq(0, 1, len=101); df <- data.frame(V1=x, V2=1-x)`, a comparison of the three expressions in my previous comment shows that `expand.grid` is over twice as fast as `tidyr::expand_grid` and over 15x faster than `tidyr::expand`. – r2evans Jun 03 '22 at 13:36
  • But to be fair, when we start with `x <- seq(0, 1, len=1001)` (producing an expansion with 1Mi rows), that's when the ratio is reversed, as `tidyr::expand` is now 1.7x faster than `expand.grid`. With this added context, I suggest you update your question to either (a) identify your expected dimensionality/size, and/or (b) perhaps give some context why you really need to squeeze something from this stone. (Even in this example, langtang's suggestion is still 7x faster than `expand.grid`, so it shows clear dominance in all samples thus far.) – r2evans Jun 03 '22 at 13:39
  • Thank you, @r2evans, for your comments. If I am not mistaken, in the comparison made [here](https://stackoverflow.com/questions/68880025/how-to-speed-up-expand-grid-in-r) the `data.table::CJ` function is around twice as fast as `base::expand.grid()`. However, the behavior I want to mimic is the one that `base::expand.grid()` has when receiving a data.frame. Additionally, for context, [this is the function](https://stackoverflow.com/a/70667708/10714156) I want to speed up **drastically** because I am using around 10.000 draws. – Álvaro A. Gutiérrez-Vargas Jun 03 '22 at 13:43
  • How are `do.call(CJ, df)` and `expand.grid(df)` different? Other than row order, they produce effectively-identical results. – r2evans Jun 03 '22 at 13:47
  • 1
    @r2evans they are not different. However, I *just* noticed that this was what I was searching for :)! – Álvaro A. Gutiérrez-Vargas Jun 03 '22 at 13:48
  • 1
    Really, then, the only difference between this and https://stackoverflow.com/questions/68880025/how-to-speed-up-expand-grid-in-r (from your comment-link) is the use of `do.call(..., df)` that ThomasIsCoding has introduced in their answer. Not strictly a dupe because of that fact, but it seems now they are very closely related. Glad you found what you needed! – r2evans Jun 03 '22 at 13:50
  • 1
    A little experimenting has me thinking the fastest option within your use case function is `do.call(data.table::CJ, list(x, 1 - x))`; i.e., don't make it into a data.frame first. You might look at where the biggest slowdown is, e.g., with `profvis`. – lhs Jun 03 '22 at 14:12

2 Answers2

4

Try with do.call

> do.call(tidyr::expand_grid, df)
# A tibble: 4 x 2
     V1    V2
  <dbl> <dbl>
1   0.3   0.6
2   0.3   0.4
3   0.7   0.6
4   0.7   0.4

> do.call(tidyr::crossing, df)
# A tibble: 4 x 2
     V1    V2
  <dbl> <dbl>
1   0.3   0.4
2   0.3   0.6
3   0.7   0.4
4   0.7   0.6

> do.call(data.table::CJ, df)
    V1  V2
1: 0.3 0.4
2: 0.3 0.6
3: 0.7 0.4
4: 0.7 0.6
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
0

Try tidyr::expand()

tidyr::expand(df,df[,1],df[,2])
Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
Asitav Sen
  • 56
  • 4
  • 2
    If you're going to use `tidyr::expand` (you should be explicit about its package since it is not base R), the use of `df[,1]` defeats the intent of non-standard evaluation, and is an anti-pattern in almost all of the tidyverse. I suggest this should really be `tidyr::expand(df, V1, V2)` to be a little more relevant. – r2evans Jun 03 '22 at 13:49