Yes, there are some inefficiencies in your code:
- You convert the
BIRTHDT
column to Date
twice. (This is by far the biggest issue.)
base::as.Date
isn't super fast
- You can use
dplyr::if_else
instead of base::ifelse
for a little bit of performance gain.
Let's do some tests:
library(microbenchmark)
library(dplyr)
library(lubridate)
mbm = microbenchmark::microbenchmark
# generate big-ish sample data
n = 1e5
dates = seq.Date(from = Sys.Date(), length.out = n, by = "day")
raw_dates = format(dates, "%m/%d/%Y")
df = data.frame(x = 1:n)
Date conversion
mbm(
mdy = mdy(raw_dates),
base = as.Date(raw_dates, format = "%m/%d/%Y")
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# mdy 21.39190 27.97036 37.35768 29.50610 31.44242 197.2258 100 a
# base 86.75255 92.30122 99.34004 96.78687 99.90462 262.6260 100 b
Looks like lubridate::mdy
is 2-3x faster than as.Date
at this particular date conversion.
Extracting year
mbm(
year = year(dates),
format = format(dates, "%Y")
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# year 29.10152 31.71873 44.84572 33.48525 40.17116 478.8377 100 a
# format 77.16788 81.14211 96.42225 83.54550 88.11994 242.7808 100 b
Similarly, lubridate::year
(which you already seem to be using) is about 2x faster than base::format
for extracting the year.
Adding a column:
mbm(
base_dollar = {dd = df; dd$y = 1},
base_bracket = {dd = df; dd[["y"]] = 1},
mutate = {dd = mutate(df, y = 1)},
mutate_pipe = {dd = df %>% mutate(y = 1)},
times = 100L
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# base_dollar 114.834 129.1715 372.8024 146.2275 408.4255 3315.964 100 a
# base_bracket 118.585 139.6550 332.1661 156.3530 255.2860 3126.967 100 a
# mutate 420.515 466.8320 673.9109 554.4960 745.7175 2821.070 100 b
# mutate_pipe 522.402 600.6325 852.2037 715.1110 906.4700 3319.950 100 c
Here we see base do very well. But also notice that these times are in microseconds whereas the above times for the date stuff were in milliseconds. Whether you use base
or dplyr
to add a column, it's about 1% of the time used to do the date conversions.
ifelse
x = rnorm(1e5)
mbm(
base_na = ifelse(x > 0, NA, x),
base_na_real = ifelse(x > 0, NA_real_, x),
base_replace = replace(x, x > 0, NA_real_),
dplyr = if_else(x > 0, NA_real_, x),
units = "ms"
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# base_na 9.399593 13.399255 18.502441 14.734466 15.998573 138.33834 100 bc
# base_na_real 8.785988 12.638971 22.885304 14.075802 16.980263 132.18165 100 c
# base_replace 0.748265 1.136756 2.292686 1.384161 1.802833 9.05869 100 a
# dplyr 5.141753 6.875031 14.157227 10.095069 11.561044 124.99218 100 b
Here the timing is still in milliseconds, but the difference between ifelse
and dplyr::if_else
isn't so extreme. dplyr::if_else
requires that the return vectors are the same type, so we have to specify the NA_real_
for it to work with the numeric output. At Frank's suggestion I threw in base::replace
with NA_real_
too, and it is about 10x faster. The lesson here, I think, is "use the simplest function that works".
In summary, dplyr
is slower than base
at adding a column, but both are super fast compared to everything else that's going on. So it doesn't much matter which column-adding method you use. You can speed up your code by not repeating calculations and by using faster versions of bigger operations. Using what we learned, a more efficient version of your code would be:
library(dplyr)
library(lubridate)
D04_Base2 <- D04_Base %>%
mutate(
birthdate = mdy(BIRTHDT),
birthyr = year(birthdate),
age = (snapshotDt - birthdate) / 365.25,
age = replace(age > 100, NA_real_)
)
We can ballpark the speed gain on 1e5 rows at about 180 milliseconds as broken out below.
- 170 ms (single
lubridate::mdy
at 30 ms instead of two as.Date
calls at 100 ms each)
- 10 ms (
replace
rather than ifelse
)
The adding a column benchmark suggests that we could save about 0.1 ms by not using the pipe. Since we are adding multiple columns, it's probably more efficient to use dplyr
than to add them individually with $<-
, but for a single column we could save about 0.5 ms by not using dplyr
. Since we've already sped up by 180-ish ms, the potential fraction of a millisecond gained by not using mutate
is a rounding error, not an efficiency boost.
In this case, the most complicated thing you're doing is the Date
conversion, but even this is likely not your bottleneck if you're doing more processing. To optimize your code you should see which pieces are slow, and work on the slow bits. This is called profiling. In this answer I used microbenchmark
to compare competing short methods head-to-head, but other tools (like the lineprof
package) are better for identifying the slowest parts of a block of code.