3

I'm trying to speed up the creation of table with all possible combinations between two vectors. We can get this functionality from base R when we use expand.grid(). However, I was wondering whether we can accomplish the same result, but faster, using tools from {collapse} package.

There has been a StackOverflow thread about this topic here. But even if we take the fastest solution provided there it is somewhat slowest in the following case. Although tidyr::expand_grid() is speedier than base R, I still hope that utilizing collapse package we can get faster processing times.

#library(collapse)
#library(tidyr)
library(babynames)

year  <- collapse::funique(babynames$year, sort = TRUE)
names <- collapse::funique(babynames$name)

expand.grid.jc <- function(seq1,seq2) { ## from https://stackoverflow.com/a/10407457/6105259
  as.data.frame(cbind(Var1 = rep.int(seq1, length(seq2)), 
                      Var2 = rep.int(seq2, rep.int(length(seq1),length(seq2)))))
}

my_benchmarking <- 
  bench::mark(base = expand.grid(year, names),
              jc = expand.grid.jc(year, names),
              tidyr = tidyr::expand_grid(year, names), check = FALSE, iterations = 10)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

my_benchmarking
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base        965.3ms    1.06s    0.938      701MB    2.35 
#> 2 jc            13.1s   13.39s    0.0747     820MB    0.120
#> 3 tidyr       541.2ms 656.71ms    1.55       316MB    1.24

Created on 2021-08-22 by the reprex package (v2.0.0)

Would be happy to learn whether this task could possibly be computed faster.

Emman
  • 3,695
  • 2
  • 20
  • 44
  • 5
    Never, ever do as.data.frame(cbind(...)). Use data.frame(...). – Roland Aug 22 '21 at 10:09
  • Don't know if `collapse` provides a similar function, but you can try `data.table::CJ` which is [fast](https://stackoverflow.com/a/18541620/1851712). Also, use larger data (e.g. `V1 = 1:1e4` and `V2 = 1:1e4`) to find that `base::expand.grid` is faster than `tidyr::expand_grid` (benchmarking on subsecond data is rarely relevant). – Henrik Aug 22 '21 at 10:16
  • @Henrik, thanks. I'm intrigued by your comment about the kind of data to use when benchmarking. The data in my real situation is more similar to `babynames` than to `1:1e4`, as I have character data and not integers. – Emman Aug 22 '21 at 14:45
  • 1
    @Emman Thank you for your feedback. I understand if my comment may have come across as unclear/irrelevant. My point (if any) was more about the total number of combinations of the two input vectors used for timing, rather than character vs. integer. It may well be that the sizes I used are less representative of your data. Cheers. – Henrik Aug 22 '21 at 14:53

1 Answers1

3

You may try data.table::CJ function.

bench::mark(base = expand.grid(year, names),
            jc = expand.grid.jc(year, names),
            tidyr1 = tidyr::expand_grid(year, names), 
            tidyr2 = tidyr::crossing(year, names), 
            dt = data.table::CJ(year, names),
            check = FALSE, iterations = 10)

#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory  time   gc   
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>  <list> <lis>
#1 base       635.48ms 715.02ms     1.25      699MB    2.00     10    16      8.02s <NULL> <Rprof… <benc… <tib…
#2 jc            5.66s    5.76s     0.172     820MB    0.275    10    16     58.13s <NULL> <Rprof… <benc… <tib…
#3 tidyr1     195.03ms 268.97ms     4.01      308MB    2.00     10     5       2.5s <NULL> <Rprof… <benc… <tib…
#4 tidyr2     590.91ms 748.35ms     1.31      312MB    0.656    10     5      7.62s <NULL> <Rprof… <benc… <tib…
#5 dt          318.1ms 384.21ms     2.47      206MB    0.986    10     4      4.06s <NULL> <Rprof… <benc… <tib…

PS - Also included tidyr::crossing for comparison as it does the same thing.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • That's a fantastic comparison, @RonakShah !! I was trying to make some speed improvements in my code, but I noticed some differences among those functions when receiving a data.frame. Could you please look at this question I posted regarding this issue? https://stackoverflow.com/questions/72490291/replicate-expand-grid-behavior-with-data-frames-using-tidyr-data-table Thank you a lot! – Álvaro A. Gutiérrez-Vargas Jun 03 '22 at 13:33