Consider this simple example
library(lubridate)
library(dplyr)
df1 <- tibble(timestamp = c(ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123')))
df2 <- tibble(timestamp = c(ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'),
ymd_hms('2019-01-01 10:00.00.123'))) %>%
mutate(timestamp = as.numeric(timestamp))
As you can see, the only difference between df1
and df2
is the representation of the timestamps.
Not take a look at the crazy difference in timings
#first lets make them bigger. 400k rows is enough
df1 <- map_dfr(seq(1:100000), ~df1)
df2 <- map_dfr(seq(1:100000), ~df2)
Now a simple computation
> microbenchmark(
+ df2 %>% mutate(diff = timestamp - min(timestamp)),
+ times = 1000)
Unit: milliseconds
expr min lq mean median
df2 %>% mutate(diff = timestamp - min(timestamp)) 1.541533 2.182028 3.961685 2.327694
uq max neval
2.567314 290.823 1000
while
> microbenchmark(
+ df1 %>% mutate(diff = timestamp - min(timestamp)),
+ times = 1000)
Unit: milliseconds
expr min lq mean median
df1 %>% mutate(diff = timestamp - min(timestamp)) 4.111016 8.182359 13.1351 8.513956
uq max neval
9.065631 378.1961 1000
Boom! more than 3 times slower. Why is that? Thanks!