OP's question Why are my functions on lubridate dates so slow? and some generalizing statements like Lubridate is just kind of slow in my experience suggest that a particular package might be the cause for low performance.
I want to verify this with some benchmarks.
Penalty of using the double colon operator ::
Frank mentioned in his comment that there is a penalty in using the double colon operator ::
to access exported variables or functions in a namespace.
# creating data
n <- 10^1L
fmt <- "%F"
chr_dates <- format(Sys.Date() + seq_len(n), "%F")
# loading lubridate into namespace
library(lubridate)
microbenchmark::microbenchmark(
base1 = r1 <- as.Date(chr_dates),
base2 = r2 <- base::as.Date(chr_dates),
lubr1 = r3 <- as_date(chr_dates),
lubr2 = r4 <- lubridate::as_date(chr_dates),
times = 100L
)
Unit: microseconds
expr min lq mean median uq max neval cld
base1 87.977 89.1100 92.03587 89.865 90.9980 128.756 100 a
base2 94.018 95.7175 100.64848 97.039 99.3045 179.351 100 b
lubr1 92.508 94.2070 98.21307 95.151 97.7940 175.954 100 b
lubr2 101.569 103.0800 109.98974 104.024 107.9885 258.643 100 c
The penalty for using the double colon operator ::
is about 10 microseconds.
This only matters if a function is called repeatedly (as it happens in OP's code using sapply()
). IMHO, the pain of debugging namespace conflicts or maintaining code where the origin of functions is unclear is much higher. Your mileage may vary, of course.
The timings can be verified for n = 100
,
Unit: microseconds
expr min lq mean median uq max neval cld
base1 556.933 561.0855 580.3382 562.9730 590.7250 812.176 100 a
base2 564.483 568.2600 588.5695 570.9030 596.2010 989.262 100 a
lubr1 562.596 565.9935 587.4443 568.4480 594.8790 1039.480 100 a
lubr2 572.036 575.9995 597.1557 578.4545 601.1085 1230.159 100 a
Converting character dates to class Date
There is a number of packages which deal with the conversion of character dates given in different formats to class Date
or POSIXct
. Some of them aim at performance, others at convenience.
Here, base
, lubridate
, anytime
, fasttime
, and data.table
(because it was mentioned in one of the answers) are compared.
Input are character dates in the standard unambiguous format YYYY-MM-DD
. Time zones are ignored.
fasttime
accepts only dates between 1970 and 2199, so the creation of sample data had to be modified in order to create a sample data set of 100 K dates.
n <- 10^5L
fmt <- "%F"
set.seed(123L)
chr_dates <- format(
sample(
seq(as.Date("1970-01-01"), as.Date("2199-12-31"), by = 1L),
n, replace = TRUE),
"%F")
Because Frank had suspected that guessing formats could add a penalty, the functions are called with and without given format where possible. All functions are called using the double colon operator ::
.
microbenchmark::microbenchmark(
base_ = r1 <- base::as.Date(chr_dates),
basef = r1 <- base::as.Date(chr_dates, fmt),
lub1_ = r2 <- lubridate::as_date(chr_dates),
lub1f = r2 <- lubridate::as_date(chr_dates, fmt),
lub2_ = r3 <- lubridate::ymd(chr_dates),
anyt_ = r4 <- anytime::anydate(chr_dates),
idat_ = r5 <- data.table::as.IDate(chr_dates),
idatf = r5 <- data.table::as.IDate(chr_dates, fmt),
fast_ = r6 <- fasttime::fastPOSIXct(chr_dates),
fastd = r6 <- as.Date(fasttime::fastPOSIXct(chr_dates)),
times = 5L
)
# check results
all.equal(r1, r2)
all.equal(r1, r3)
all.equal(r1, c(r4)) # remove tzone attribute
all.equal(r1, as.Date(r5)) # convert IDate to Date
all.equal(r1, as.Date(r6)) # convert POSIXct to Date
Unit: milliseconds
expr min lq mean median uq max neval cld
base_ 641.799082 645.008517 648.128466 648.791875 649.149444 655.893411 5 d
basef 69.377419 69.937371 73.888828 71.403139 76.022083 82.704127 5 b
lub1_ 644.199361 645.217696 680.542327 649.855896 652.887492 810.551189 5 d
lub1f 69.769726 69.947943 70.944605 70.795234 71.365759 72.844364 5 b
lub2_ 18.672495 27.025711 26.990218 28.180730 29.944409 31.127747 5 ab
anyt_ 381.870316 384.513758 386.211134 384.992152 385.159043 394.520400 5 c
idat_ 643.386808 644.312259 649.385356 648.204359 651.666396 659.356958 5 d
idatf 69.844109 71.188673 75.319481 77.142365 78.156923 80.265334 5 b
fast_ 4.994637 5.363533 5.748137 5.601031 5.760370 7.021112 5 a
fastd 5.230625 6.296157 6.686500 6.345998 6.538941 9.020780 5 a
The timings show that
- Frank's suspicion is correct. Guessing formats is costly. Passing the format as parameter to
as.Date()
, as_date()
, and as.IDate()
is ten times faster than calling without.
fasttime::fastPOSIXct()
is the fastest, indeed. Even with the additional conversion from POSIXct
to Date
it is four times faster than the second fastest lubridate::ymd()
.