Like most people, I'm impressed by Hadley Wickham and what he's done for R
-- so i figured that i'd move some functions toward his tidyverse
... having done so i'm left wondering what the point of it all is?
My new dplyr
functions are much slower than their base equivalents -- i hope i'm doing something wrong. I'd particularly like some payoff from the effort required to understand non-standard-evaluation
.
So, what am i doing wrong? Why is dplyr
so slow?
An example:
require(microbenchmark)
require(dplyr)
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
addSpread_base <- function() {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function() df %>% mutate(spread := a - b)
all.equal(addSpread_base(), addSpread_dplyr())
microbenchmark(addSpread_base(), addSpread_dplyr(), times = 1e4)
Timing results:
Unit: microseconds
expr min lq mean median uq max neval
addSpread_base() 12.058 15.769 22.07805 24.58 26.435 2003.481 10000
addSpread_dplyr() 607.537 624.697 666.08964 631.19 636.291 41143.691 10000
So using dplyr
functions to transform the data takes about 30x longer -- surely this isn't the intention?
I figured that perhaps this is too easy a case -- and that dplyr
would really shine if we had a more realistic case where we are adding a column and sub-setting the data -- but this was worse. As you can see from the timings below, this is ~70x slower than the base approach.
# mutate and substitute
addSpreadSub_base <- function(df, col1, col2) {
df[['spread']] <- df[['a']] - df[['b']]
df[, c(col1, col2, 'spread')]
}
addSpreadSub_dplyr <- function(df, col1, col2) {
var1 <- as.name(col1)
var2 <- as.name(col2)
qq <- quo(!!var1 - !!var2)
df %>%
mutate(spread := !!qq) %>%
select(!!var1, !!var2, spread)
}
all.equal(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'))
microbenchmark(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'),
times = 1e4)
Results:
Unit: microseconds
expr min lq mean median uq max neval
addSpreadSub_base(df, col1 = "a", col2 = "b") 22.725 30.610 44.3874 45.450 53.798 2024.35 10000
addSpreadSub_dplyr(df, col1 = "a", col2 = "b") 2748.757 2837.337 3011.1982 2859.598 2904.583 44207.81 10000