13

Like most people, I'm impressed by Hadley Wickham and what he's done for R -- so i figured that i'd move some functions toward his tidyverse ... having done so i'm left wondering what the point of it all is?

My new dplyr functions are much slower than their base equivalents -- i hope i'm doing something wrong. I'd particularly like some payoff from the effort required to understand non-standard-evaluation.

So, what am i doing wrong? Why is dplyr so slow?

An example:

require(microbenchmark)
require(dplyr)

df <- tibble(
             a = 1:10,
             b = c(1:5, 4:0),
             c = 10:1)

addSpread_base <- function() {
    df[['spread']] <- df[['a']] - df[['b']]
    df
}

addSpread_dplyr <- function() df %>% mutate(spread := a - b)

all.equal(addSpread_base(), addSpread_dplyr())

microbenchmark(addSpread_base(), addSpread_dplyr(), times = 1e4)

Timing results:

Unit: microseconds
              expr     min      lq      mean median      uq       max neval
  addSpread_base()  12.058  15.769  22.07805  24.58  26.435  2003.481 10000
 addSpread_dplyr() 607.537 624.697 666.08964 631.19 636.291 41143.691 10000

So using dplyr functions to transform the data takes about 30x longer -- surely this isn't the intention?

I figured that perhaps this is too easy a case -- and that dplyr would really shine if we had a more realistic case where we are adding a column and sub-setting the data -- but this was worse. As you can see from the timings below, this is ~70x slower than the base approach.

# mutate and substitute
addSpreadSub_base <- function(df, col1, col2) {
    df[['spread']] <- df[['a']] - df[['b']]
    df[, c(col1, col2, 'spread')]
}

addSpreadSub_dplyr <- function(df, col1, col2) {
    var1 <- as.name(col1)
    var2 <- as.name(col2)
    qq <- quo(!!var1 - !!var2)
    df %>% 
        mutate(spread := !!qq) %>% 
        select(!!var1, !!var2, spread)
}

all.equal(addSpreadSub_base(df, col1 = 'a', col2 = 'b'), 
          addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'))

microbenchmark(addSpreadSub_base(df, col1 = 'a', col2 = 'b'), 
               addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'), 
               times = 1e4)

Results:

Unit: microseconds
                                           expr      min       lq      mean   median       uq      max neval
  addSpreadSub_base(df, col1 = "a", col2 = "b")   22.725   30.610   44.3874   45.450   53.798  2024.35 10000
 addSpreadSub_dplyr(df, col1 = "a", col2 = "b") 2748.757 2837.337 3011.1982 2859.598 2904.583 44207.81 10000
ricardo
  • 8,195
  • 7
  • 47
  • 69
  • 3
    Do you use data.table? For me it is much useful and fast. Best! – LocoGris Jan 23 '19 at 10:07
  • 7
    A nice read: https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly. tldr is: tidyverse is made for clean code, not necessarily for faster code.. – RLave Jan 23 '19 at 10:07
  • 6
    @RLave With a particular definition of "clean code". – Roland Jan 23 '19 at 10:12
  • 1
    @RLave would you call the NSE jargon clean? I would not. – ricardo Jan 23 '19 at 10:13
  • Yes Roland is right, it's more of a style, I guess it comes down to personal taste. – RLave Jan 23 '19 at 10:13
  • 2
    @ricardo Just compare the number of function calls between the two approaches. If you write low-level functions where you care about micro- to milliseconds, you should probably not use the tidyverse. – Roland Jan 23 '19 at 10:15
  • @ricardo I would not, but `data.table` style is not that much more immediate to read either. Again, it comes down to taste. If you care about performance, it's not the best solution. – RLave Jan 23 '19 at 10:17
  • Also, using something like `as.name` just to be able to use NSE in internal code ist just ..., weird. – Roland Jan 23 '19 at 10:17
  • 4
    @ricardo A side-note: I'm surprised that you used `:=` in `mutate`, and that it worked. Isn't `=` the standard? – Henrik Jan 23 '19 at 10:21
  • @Henrik [have a look at this](https://stackoverflow.com/questions/32077483/colons-equals-operator-in-r-new-syntax) – Sotos Jan 23 '19 at 10:25
  • 1
    I recalled a similar post on `plyr`, managed to find it, and to my great pleasure I note that @ricardo posted that as well :), [Why is plyr so slow?](https://stackoverflow.com/questions/11533438/why-is-plyr-so-slow). Of course _a lot_ is different between `plyr` and `dplyr`, still many of the arguments provided for and against `plyr` (and `data.table`), in the answer and not the least in comments, are echoed here. – Henrik Jan 23 '19 at 11:00
  • `dplyr` can be faster, sometimes even *much* faster, but only with large datasets. – Rui Barradas Jan 23 '19 at 11:06
  • `dplyr` is still OK. Try to use `fill` from `tidyr` :-) – arg0naut91 Jan 23 '19 at 12:23

1 Answers1

8

These are micro seconds, your dataset has 10 rows, unless you plan on looping on millions of datasets of 10 rows your benchmark is pretty much irrelevant (and in that case I can't imagine a situation where it wouldn't be wise to bind them together as a first step).

Let's do it with a bigger dataset, like 1 million times bigger :

df <- tibble(
  a = 1:10,
  b = c(1:5, 4:0),
  c = 10:1)

df2 <- bind_rows(replicate(1000000,df,F))

addSpread_base <- function(df) {
  df[['spread']] <- df[['a']] - df[['b']]
  df
}
addSpread_dplyr  <- function(df) df %>% mutate(spread = a - b)

microbenchmark::microbenchmark(
  addSpread_base(df2), 
  addSpread_dplyr(df2),
  times = 100)
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval cld
# addSpread_base(df2) 25.85584 26.93562 37.77010 32.33633 35.67604 170.6507   100   a
# addSpread_dplyr(df2) 26.91690 27.57090 38.98758 33.39769 39.79501 182.2847   100   a

Still quite fast and not much difference.

As for the "whys" of the result that you got, it's because you're using a much more complex function, so it has overheads.

Commenters have pointed that dplyr doesn't try too hard to be fast and maybe it's true when you compare to data.table, and interface is the first concern, but the authors have been working hard on speed as well. Hybrid evaluation for example allows (if I got it right) to execute C code directly on grouped data when aggregating with common functions, which can be much faster than base code, but simple code will always run faster with simple functions.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • i understood that dplyr had solved most of the speed problems -- hence my disappointment. Also, i think that the non-standard-evaluation situation is a bit of a mess. IMO there **has to** be some performance upside to repay the effort of getting up that curve. – ricardo Jan 23 '19 at 13:58
  • but there is, I found this : https://rpubs.com/hadley/dplyr-benchmarks, it's from 5 years ago and `dplyr` was often much faster than base, there was a much more recent benchmark discussing hybrid evaluation but I can't put my hands on it. – moodymudskipper Jan 23 '19 at 14:05
  • well, i suppose i'm doing it wrong. I bench-marked the `addSpreadSub` functions and `base` is ~20% faster than `dplyr` in the 10mil row case. So it's not just a question of scale. – ricardo Jan 23 '19 at 14:08
  • Interesting, I have no idea why such a big difference would remain, and I can't reproduce it – moodymudskipper Jan 23 '19 at 14:10