4

I am using dplyr to create three new variables on my data frame. The data frame is 84,253 obs. of 164 variables. Below is my code.

# ptm <- proc.time()
 D04_Base2 <- D04_Base %>% 
    mutate(
        birthyr = year(as.Date(BIRTHDT,"%m/%d/%Y")),
        age = (snapshotDt - as.Date(BIRTHDT,"%m/%d/%Y")) / 365.25,
        age = ifelse(age > 100, NA, age)
        )
# proc.time() - ptm
user  system elapsed 
12.34    0.03   12.42 

However, I am wondering if there is a noticeable issue with my code as it is taking much longer than I expected to run or is this something else. As displayed above, it is taking about 12 seconds for the code to complete.

zx8754
  • 52,746
  • 12
  • 114
  • 209
roarkz
  • 811
  • 10
  • 22
  • would be faster if you made BIRTHDT into a Date just once – Richard Telford Feb 28 '17 at 22:12
  • Also, I tested creating the variables using Base R and the results were much faster. After all my research it seemed dplyr should be faster than Base and hence the question. – roarkz Feb 28 '17 at 22:12
  • 2
    What was the equivalent base R code that you compared this against? You really should include some sort of [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so we can run it ourselves to compare. – MrFlick Feb 28 '17 at 22:14
  • Yeah, you should just store dates as Date from the beginning. No point doing any analysis with a date-as-string. Also, `system.time({commands})` is a more reliable way to do timings, I've been told. – Frank Feb 28 '17 at 22:17
  • Adding a column is a very simple task, and one base R does quite well at. It is a little dated now, but [this blog post](http://datascience.la/dplyr-and-a-very-basic-benchmark/) suggests that for simply adding a new column `dplyr` is comparable with `base`. `dplyr` will do better than base especially in joins and grouped operations. – Gregor Thomas Feb 28 '17 at 22:18
  • Did you try to use the equivalent package: `data.table` check this [tutorial](https://www.datacamp.com/courses/data-table-data-manipulation-r-tutorial). Some benchmarks consider it has better performance than `dplyr` package (see this [post](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) for a discussion about it). Also consider using `setDT` from the same package it allows to change the input by reference (no copy), that would provide you a much better performance. Try it and let us know. – David Leal Feb 28 '17 at 22:49

1 Answers1

5

Yes, there are some inefficiencies in your code:

  1. You convert the BIRTHDT column to Date twice. (This is by far the biggest issue.)
  2. base::as.Date isn't super fast
  3. You can use dplyr::if_else instead of base::ifelse for a little bit of performance gain.

Let's do some tests:

library(microbenchmark)
library(dplyr)
library(lubridate)

mbm = microbenchmark::microbenchmark

# generate big-ish sample data
n = 1e5
dates = seq.Date(from = Sys.Date(), length.out = n, by = "day")
raw_dates = format(dates, "%m/%d/%Y")
df = data.frame(x = 1:n)

Date conversion

mbm(
    mdy = mdy(raw_dates),
    base = as.Date(raw_dates, format = "%m/%d/%Y")
)
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval cld
#   mdy 21.39190 27.97036 37.35768 29.50610 31.44242 197.2258   100  a 
#  base 86.75255 92.30122 99.34004 96.78687 99.90462 262.6260   100   b

Looks like lubridate::mdy is 2-3x faster than as.Date at this particular date conversion.

Extracting year

mbm(
    year = year(dates),
    format = format(dates, "%Y")
)
# Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval cld
#    year 29.10152 31.71873 44.84572 33.48525 40.17116 478.8377   100  a 
#  format 77.16788 81.14211 96.42225 83.54550 88.11994 242.7808   100   b

Similarly, lubridate::year (which you already seem to be using) is about 2x faster than base::format for extracting the year.

Adding a column:

mbm(
    base_dollar = {dd = df; dd$y = 1},
    base_bracket = {dd = df; dd[["y"]] = 1},
    mutate = {dd = mutate(df, y = 1)},
    mutate_pipe = {dd = df %>% mutate(y = 1)},
    times = 100L
)
# Unit: microseconds
#          expr     min       lq     mean   median       uq      max neval cld
#   base_dollar 114.834 129.1715 372.8024 146.2275 408.4255 3315.964   100 a  
#  base_bracket 118.585 139.6550 332.1661 156.3530 255.2860 3126.967   100 a  
#        mutate 420.515 466.8320 673.9109 554.4960 745.7175 2821.070   100  b 
#   mutate_pipe 522.402 600.6325 852.2037 715.1110 906.4700 3319.950   100   c

Here we see base do very well. But also notice that these times are in microseconds whereas the above times for the date stuff were in milliseconds. Whether you use base or dplyr to add a column, it's about 1% of the time used to do the date conversions.

ifelse

x = rnorm(1e5)
mbm(
    base_na = ifelse(x > 0, NA, x),
    base_na_real = ifelse(x > 0, NA_real_, x),
    base_replace = replace(x, x > 0, NA_real_),
    dplyr = if_else(x > 0, NA_real_, x),
    units = "ms"
)
# Unit: milliseconds
#          expr      min        lq      mean    median        uq       max neval cld
#       base_na 9.399593 13.399255 18.502441 14.734466 15.998573 138.33834   100  bc
#  base_na_real 8.785988 12.638971 22.885304 14.075802 16.980263 132.18165   100   c
#  base_replace 0.748265  1.136756  2.292686  1.384161  1.802833   9.05869   100 a  
#         dplyr 5.141753  6.875031 14.157227 10.095069 11.561044 124.99218   100  b 

Here the timing is still in milliseconds, but the difference between ifelse and dplyr::if_else isn't so extreme. dplyr::if_else requires that the return vectors are the same type, so we have to specify the NA_real_ for it to work with the numeric output. At Frank's suggestion I threw in base::replace with NA_real_ too, and it is about 10x faster. The lesson here, I think, is "use the simplest function that works".


In summary, dplyr is slower than base at adding a column, but both are super fast compared to everything else that's going on. So it doesn't much matter which column-adding method you use. You can speed up your code by not repeating calculations and by using faster versions of bigger operations. Using what we learned, a more efficient version of your code would be:

library(dplyr)
library(lubridate)
D04_Base2 <- D04_Base %>% 
    mutate(
        birthdate = mdy(BIRTHDT),
        birthyr = year(birthdate),
        age = (snapshotDt - birthdate) / 365.25,
        age = replace(age > 100, NA_real_)
    )

We can ballpark the speed gain on 1e5 rows at about 180 milliseconds as broken out below.

  • 170 ms (single lubridate::mdy at 30 ms instead of two as.Date calls at 100 ms each)
  • 10 ms (replace rather than ifelse)

The adding a column benchmark suggests that we could save about 0.1 ms by not using the pipe. Since we are adding multiple columns, it's probably more efficient to use dplyr than to add them individually with $<-, but for a single column we could save about 0.5 ms by not using dplyr. Since we've already sped up by 180-ish ms, the potential fraction of a millisecond gained by not using mutate is a rounding error, not an efficiency boost.

In this case, the most complicated thing you're doing is the Date conversion, but even this is likely not your bottleneck if you're doing more processing. To optimize your code you should see which pieces are slow, and work on the slow bits. This is called profiling. In this answer I used microbenchmark to compare competing short methods head-to-head, but other tools (like the lineprof package) are better for identifying the slowest parts of a block of code.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Regarding their ifelse, I was thinking `replace(age, age > 100, NA_real_)` looked like a better choice. Dunno how that fares in benchmarks, though. – Frank Feb 28 '17 at 23:15
  • 1
    Good thought - I put it in. Turns out it doesn't matter: median is lower but mean is higher in my 100 trials. – Gregor Thomas Feb 28 '17 at 23:24
  • 1
    @Frank better thought - I had not read your comment well. *Now* I went back and used `replace` and it is 10x faster than `if_else`. – Gregor Thomas Feb 28 '17 at 23:57