1

Consider this simple example

library(lubridate)
library(dplyr)

df1 <- tibble(timestamp = c(ymd_hms('2019-01-01 10:00.00.123'),
                     ymd_hms('2019-01-01 10:00.00.123'),
                     ymd_hms('2019-01-01 10:00.00.123'),
                     ymd_hms('2019-01-01 10:00.00.123')))


df2 <- tibble(timestamp = c(ymd_hms('2019-01-01 10:00.00.123'),
                            ymd_hms('2019-01-01 10:00.00.123'),
                            ymd_hms('2019-01-01 10:00.00.123'),
                            ymd_hms('2019-01-01 10:00.00.123'))) %>% 
  mutate(timestamp = as.numeric(timestamp))

As you can see, the only difference between df1 and df2 is the representation of the timestamps.

Not take a look at the crazy difference in timings

#first lets make them bigger. 400k rows is enough
df1 <- map_dfr(seq(1:100000), ~df1)
df2 <- map_dfr(seq(1:100000), ~df2)

Now a simple computation

> microbenchmark(
+   df2 %>% mutate(diff = timestamp - min(timestamp)),
+ times = 1000)
Unit: milliseconds
                                              expr      min       lq     mean   median
 df2 %>% mutate(diff = timestamp - min(timestamp)) 1.541533 2.182028 3.961685 2.327694
       uq     max neval
 2.567314 290.823  1000

while

> microbenchmark(
+   df1 %>% mutate(diff = timestamp - min(timestamp)),
+ times = 1000)
Unit: milliseconds
                                              expr      min       lq    mean   median
 df1 %>% mutate(diff = timestamp - min(timestamp)) 4.111016 8.182359 13.1351 8.513956
       uq      max neval
 9.065631 378.1961  1000

Boom! more than 3 times slower. Why is that? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 1
    You might be interested in a recent [benchmark](https://stackoverflow.com/a/56755001/6574038) I did with time conversion. Try, include a base R solution into your benchmarking. – jay.sf Jun 30 '19 at 14:28
  • Possible duplicate of [How to fast convert different time formats in large data frames?](https://stackoverflow.com/questions/56753909/how-to-fast-convert-different-time-formats-in-large-data-frames) – CPak Jun 30 '19 at 14:52
  • 1
    ahh guys stop with those duplicate flags that are not duplicates... this is about understand where the overhead is – ℕʘʘḆḽḘ Jun 30 '19 at 14:53
  • 4
    I don't understand why this surprises you, even without looking into details I'm not surprised at all by pure arithmetics with numeric values outperforming arithmetics with S3 objects. I suggest you do some profiling. Doing that will answer your question (and profiling is a valuable skill if you are interested in performance). – Roland Jun 30 '19 at 15:03
  • 4
    Btw., your question would be much better if you only used base R. All this tidyverse stuff is only obfuscating to me. – Roland Jun 30 '19 at 15:05
  • thanks @Roland. Never done any profiling. Could you please do it on the small example and post as an answer? – ℕʘʘḆḽḘ Jun 30 '19 at 15:06
  • Sorry, I'm currently not in front of a PC. Maybe start with reading this: https://support.rstudio.com/hc/en-us/articles/218221837-Profiling-with-RStudio?mobile_site=true – Roland Jun 30 '19 at 15:15
  • 2
    Following up on [Roland's comment](https://stackoverflow.com/questions/56825566/why-is-processing-timestamps-so-slow#comment100203990_56825566), a cleaner `base` example could be e.g. `x <- Sys.time() + 1:1e8`; `x2 <- as.numeric(x)`. And then do the profiling (just on `min` would be enough to reveal the S3 method dispatch): `profvis({min(x)})`; `profvis({min(x2)})`. Check the functions called in the 'Data' tab. – Henrik Jun 30 '19 at 16:26

0 Answers0