Programming using the tidyverse: speed issues

Question

We released the package quickpsy a few years ago (paper in the R journal paper). The package used R base functions, but also made an extensive use of functions of what was called at that time the Hadleyverse. We are now developing a new version of the package that mostly uses functions from the tidyverse and that incorporates the new non-standard evaluation approach and found that the package is much much slower (more than four times slower). We found for example that purrr::map is much slower than dplyr::do (which is deprecated):

library(tidyverse)

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    do(head(., 2))
  )

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    nest() %>% 
    mutate(temp = map(data, ~head(., 2))) %>% 
    unnest(temp)
)

We also found that functions like pull are very slow.

We are not sure whether the tidyverse is not meant to be used for this type of programming or we are not using it properly.

if you care about speed `data.table` is going to be your friend. — s_baldur, Oct 01 '18 at 13:06
Not to mention concision: `as.data.table(mtcars)[, .SD[1:2], by = cyl]` — s_baldur, Oct 01 '18 at 13:13
There may be some people here who can help (though as it stands, this is fairly broad). One place you might ask about speed differences is in the [Rstudio community tidyverse site](https://community.rstudio.com/c/tidyverse). — lmo, Oct 01 '18 at 13:20
[According to Hadley](https://stackoverflow.com/a/27840349/1286528): *We optimise dplyr for expressiveness on medium data; feel free to use data.table for raw speed on bigger data*. With `data.table` you can: `data.table(mtcars)[, .SD[1:2], cyl]` — pogibas, Oct 01 '18 at 13:27
It is likely not `map()` that is slow, but nest and unnest. Perhaps because of selection? Tidyselect is written in pure R. Your `mtcars` benchmark does not demonstrate that the functions are slow though, just that they are slower. If the slowness increases with data frame size, then there might be a problem. — Lionel Henry, Oct 01 '18 at 13:28
Your two sequences of code are not equivalent : you cannot both ask for more speed and do more computations. You can exhibit faster answers with the tidyverse with other sequences of code for example with `filter(row_number()<=2)` — Nicolas2, Oct 01 '18 at 13:35
@lionel I used this toy example from the dplyr::do help. In the quickpsy package the data frames are much larger and then this type of computation is very slow. It might be the nest and unnest, but given that do is deprecated is there any alternative to use them? — danilinares, Oct 01 '18 at 13:56
@danilinares I have done some more benchmarks, and I see the nest/unnest code is consistently twice as slow as the data frame grows. Can you confirm? — Lionel Henry, Oct 01 '18 at 14:43
@snoram This example from the help of tidyr::do is simple, but in our package we use map (and most often map2) over data frames that include lists columns (example: https://github.com/danilinares/quickpsy/blob/testingSpeed/R/apply_to_two_elements.R). I am not sure whether the data.table framework could be applied in those cases. — danilinares, Oct 02 '18 at 05:44
@danilinares most probably it *can* maybe it will be a bit of work though. I encourage you to explore the possibilities if you really want to optimise speed. — s_baldur, Oct 02 '18 at 07:58
The ```mtcars``` dataset is too small to be used for performance benchmarks: 32 rows and 11 columns. Arguably you are measuring just random noise above. Take a dataset with a few hundred megabytes (i.e ncol x nrow >= 100M) to factor out most other effects in your system (data being cached in the CPU (L3) caches, normal runtime variance, concurrent processes, etc.). — Endre, Aug 18 '19 at 06:17

score 3 · Answer 1 · answered Oct 23 '18 at 09:48

slice() is the proper tool to use if you want the first two rows of each group. Both do() and nest() %>% mutate(map()) %>% unnest() are too heavy and use more memory:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(purrr)

library(tidyverse)

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    do(head(., 2))
)
#>    user  system elapsed 
#>   0.065   0.003   0.075

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    nest() %>% 
    mutate(temp = map(data, ~head(., 2))) %>% 
    unnest(temp)
)
#>    user  system elapsed 
#>   0.024   0.000   0.024

system.time(
  mtcars %>% 
    group_by(cyl) %>% 
    slice(1:2)
)
#>    user  system elapsed 
#>   0.002   0.000   0.002

^{Created on 2018-10-23 by the reprex package (v0.2.1.9000)}

See also benchmark results in this tidyr issue

Thanks. I used the function "head" to exemplify the problem, but the code inside do is much more complex. — danilinares, Oct 23 '18 at 14:59

danilinares · Answer 2 · 2020-04-04T04:21:20.143

0

For this particular example, the slowness caused by the nest and unnest computations can be solved using group_modify

system.time(
   mtcars %>% 
   group_by(cyl) %>% 
   group_modify(~head(., 2))
)

edited Apr 04 '20 at 04:21

answered Jan 28 '19 at 10:03

danilinares

1,172
1
9
28

Programming using the tidyverse: speed issues

2 Answers2