3

I would like to convert about 100 million dates in string form (eg: "09/20/2019") in R.

Doing

a = "09/20/2019"       
a <- as.Date(a, "%m/%d/%Y")

seems to be somewhat slow. Is there a faster way to achieve this conversion?

(Tried the fasttime package but it expects the following order: year, month, day... therefore the original string needs some further manipulations)

ant
  • 93
  • 4

3 Answers3

8

Using a lookup table approach (1) is 38x faster than as.Date for this example data. Approach (2) is still 18x faster for this exammple data and is more convenient if you don't have a good way of generating d. The particulars of your data will, of course, affect the speedup actually achieved.

# test data
set.seed(123)
d <- format(as.Date("2000-01-01") + 1:100, "%m/%d/%Y")
dd <- sample(d, 1000000, replace = TRUE)

# 1 - lookup table
system.time(as.Date(d, "%m/%d/%Y")[match(dd, d)])
##    user  system elapsed 
##    0.06    0.02    0.07 

# 2 - table lookup generating u on the fly instead of d
system.time({u <- unique(dd); as.Date(u, "%m/%d/%Y")[match(dd, u)] })
## user  system elapsed 
## 0.13    0.03    0.15 

# 3 - as.Date
system.time(as.Date(dd, "%m/%d/%Y"))
##    user  system elapsed 
##    2.61    0.07    2.67 
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
6

Using lubridate


library(lubridate)
a = "09/20/2019"
mdy(a)
deepseefan
  • 3,701
  • 3
  • 18
  • 31
  • 1
    @joran it should be fast enough, I tested `mdy` on 1 million random dates, and it only took 0.22 seconds. – yusuzech Sep 20 '19 at 16:41
  • 3
    @joran it takes like 6 layers to get there but I think ultimately it's using `lubridate:::C_parse_dt` which could be faster than `as.Date` which uses the general-purpose `strptime`. I don't know enough about the C side to comment more fundamentally though – MichaelChirico Sep 20 '19 at 16:58
6

Since you mentioned fasttime, I got curious about testing it against a few other common options. It requires dates in the year-month-day format, but you can do this with some regex.

I did benchmark testing on a smaller, but still cumbersome, set of dates. I'm on a year-old MacBook Pro with a ton of other stuff running, and still doing 100 trials on 1 million dates with all 3 methods finished before I was done eating a sandwich.

set.seed(9)
days <- sample(1:30, 1e6, replace = TRUE)
date_str <- sprintf("09/%02d/2019", days)

# as.Date(date_str, format = "%m/%d/%Y")
# lubridate::mdy(date_str)
# fasttime::fastPOSIXct(gsub("^(\\d{2})/(\\d{2})/(\\d{4})", "\\3-\\1-\\2", date_str))

bench <- microbenchmark::microbenchmark(
  list = list(
    base = as.Date(date_str, format = "%m/%d/%Y"),
    lubr = lubridate::mdy(date_str),
    fast = fasttime::fastPOSIXct(gsub("^(\\d{2})/(\\d{2})/(\\d{4})", "\\3-\\1-\\2", date_str))
  )
)

bench
#> Unit: nanoseconds
#>  expr min lq mean median uq max neval cld
#>  base   3  5 7.02      5  6 180   100   a
#>  lubr   4  5 6.91      6  6 148   100   a
#>  fast   4  5 8.77      5  6 332   100   a

Based on the mean and lowest maximum, lubridate::mdy runs the fastest without having to do any reformatting or specifying a format string. Based on median, the base as.Date run fastest but requires you setting the formatting string (not a big deal), or fasttime but with the regex stipulation. Make of that what you will.

I'd also note that fasttime converts to POSIX, so since there isn't a time element set it tacks one on—removing it might then become another time-consuming step.

camille
  • 16,432
  • 18
  • 38
  • 60