3

I have a large vector of dates that has inconsistencies in date formats. Here is a sample of this vector

dates = c("2018-11-06", "11/6/2018", "11-6-2018")

I want to convert this vector to a consistent format of "2018-11-06".

I have tried the following solutions but they are very slow -

  anytime::anydate(dates)
  as.Date(lubridate::parse_date_time(dates, '%m-%d-%Y')) # Partially correct results

Please suggest a solution that is fast (since I have a vector with millions of dates) as well as correct.

Following is the performance benchmark -

microbenchmark::microbenchmark(
  user2974951 = substr(parse_date_time(dates, c("%Y-%m-%d","%d/%m/%Y","%d/%m/%Y")), 1, 10),
  user2974951_2 = substr(fast_strptime(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y")), 1, 10),
  Wimpel = {
    dates_1 = gsub("/", "-", dates)
    dates_2 = ifelse(grepl("^[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$", dates_1),
                     gsub("^([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})$", "\\3-\\1-\\2", dates_1), 
                     dates_1)
    fasttime::fastDate(dates_2)
  },
  
  TarJae = substr(parse_date_time(dates, orders=c("ymd", "mdy")), 1, 10),
  times = 10)

The solution by Wimpel is the fastest. The solution by user2974951_2 is easy to understand and quite fast, hence accepting it as the answer.

Unit: microseconds
          expr      min       lq      mean    median       uq       max neval cld
 user2974951_1 7070.939 7147.098 7385.8261 7238.3540 7520.453  8193.161    10   b
 user2974951_2  319.894  355.736  404.0914  372.4270  415.670   707.338    10  a 
        Wimpel  112.448  130.324  148.4439  152.1875  159.463   190.922    10  a 
        TarJae 6123.001 6140.183 6959.6301 6277.1425 6838.099 12173.134    10   b

Many people have voted to close this question. Please do not close this question as the solution given here Why are my functions on lubridate dates so slow? has a solution only when the date format is consistent. In my question, the dates format is inconsistent.

Saurabh
  • 1,566
  • 10
  • 23

3 Answers3

4

I believe this should work pretty fast:

create a uniform date-column, and the use the fasttime-package to convert

dates = c("2018-11-06", "11/6/2018", "11-6-2018")

library(data.table)
library(fasttime)
DT <- as.data.table(dates)
# replace all "/" charachters to "-"
DT[, dates2 := gsub("/", "-", dates)][]
# reverse strings startting with 4 digits
DT[grepl("^[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$", dates2), 
   dates2 := gsub("^([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})$", "\\3-\\1-\\2", dates2)][]
# usse fastposix to convert
DT[, dates3 := fasttime::fastDate(dates2)][]
#         dates     dates2     dates3
# 1: 2018-11-06 2018-11-06 2018-11-06
# 2:  11/6/2018  2018-11-6 2018-11-06
# 3:  11-6-2018  2018-11-6 2018-11-06
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Somehow, the year got converted to 2016 using ```fastDate```. – Saurabh Jan 18 '23 at 10:51
  • edited the answer.. had the order reversed by mistake – Wimpel Jan 18 '23 at 10:51
  • `data.table` and `fasttime` seems like the way to go. Would be interested in a benchmark showing whether your approach of manipulating the date character vector and then doing one conversion performs differently to a `data.table::fcase()` approach (or simply an `if` statement) based on the format of the character vector. My hunch is that introducing regex is going to slow it down quite a bit. – SamR Jan 18 '23 at 10:52
  • 2nd edit... problem was in my regexes for dates with a single digit day/month.. fixed now. – Wimpel Jan 18 '23 at 10:54
  • In ```dates3```, the date and month got flipped in case 2 and 3. Can you please fix this? – Saurabh Jan 18 '23 at 11:05
  • you can adapt the regex-replace pattern from`"\\3-\\2-\\1"` to `"\\3-\\1-\\2"`. edited in answer – Wimpel Jan 18 '23 at 11:48
  • Yes, this change fixed the problem. – Saurabh Jan 18 '23 at 12:54
3

Try

> lubridate::parse_date_time(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y"))
[1] "2018-11-06 UTC" "2018-06-11 UTC" "2018-06-11 UTC"

you can try as.character(...) to remove the time zone, if you do not need it to be in date format.

Another option is

fast_strptime(dates,c("%Y-%m-%d","%d/%m/%Y","%d-%m-%Y"))
user2974951
  • 9,535
  • 1
  • 17
  • 24
  • This solution is indeed fast. However, it adds a timezone at the end of the date "2018-11-06 UTC". Is there a way to remove the timezone? – Saurabh Jan 18 '23 at 10:47
  • You can select characters you want to keep in your string with the function `substr` – Yacine Hajji Jan 18 '23 at 10:54
  • Following worked fine - ```substr(parse_date_time(dates, c("%Y-%m-%d","%d/%m/%Y","%d/%m/%Y")), 1, 10)``` – Saurabh Jan 18 '23 at 10:56
  • 1
    Or: `lubridate::parse_date_time(dates,c("%Y-%m-%d", "%d/%m/%Y", "%d-%m-%Y")) |> format("%Y-%m-%d")` as `parse_date_time` returns a `POSIXct`. Alternatively `lubridate::ymd()` will do the trick if one want the `Date`-type preserved (it might not be as fast as `format` though). – harre Jan 18 '23 at 11:00
  • 1
    You use `lubridate` then you could do: `library(lubridate) parse_date_time(dates, orders=c("ymd", "mdy"))` – TarJae Jan 18 '23 at 11:11
1

tl;dr

This answer contains a fast, non-regex approach and also some benchmarking. The non-regex approach is slightly faster but the regex approach uses less memory. parse_date_time() is several orders of magnitude slower, taking over 3 minutes for the largest data frame, where the regex fasttime::fastDate() approach took 33 seconds and non-regex version 21 seconds.

The facet labels in the plots below are the number of times the sample vector is repeated, so e.g. 10000 means a table of 30000 rows.

enter image description here

A non-regex approach

My suspicion is that a non-regex approach is always faster so I tried a non-regex fasttime::fastDate() method:

DT <- as.data.table(dates)

DT[, third_char := substr(dates, 3, 3)]
DT[, `:=`(
    actualDate = fcase(
        third_char == "/", as.Date(dates, format = "%m/%d/%Y"),
        third_char == "-", as.Date(dates, format = "%m-%d-%Y"),
        third_char %in% 1:9, fastDate(dates)
    ),
    third_char = NULL
)]

Benchmarking

I decided to benchmark the three approaches here:

  1. Wimpel's regular expression and fasttime::fastDate() method.
  2. My approach which also uses fasttime::fastDate() but avoids regex.
  3. The lubridate approach suggested by @user2974951 and modified by TarJae in comments. I have also converted the output into Date format rather than POSIXct so all output is identical.

Results

As well as the plot, I include the output below. Note there are very few iterations (sometimes only one) as the data gets larger, as each one takes a long time. Possibly the timing would change if more iterations were run. If anyone wants to spend some time replicating this, I include the code to do so below.

# A tibble: 12 x 14
#    expression       size      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#    <bch:expr>      <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#  1 regex              10     31ms  33.05ms  29.9        1.19MB   2.30      13     1   434.47ms
#  2 no_regex           10  12.78ms   13.8ms  69.1        6.58MB   9.53      29     4   419.91ms
#  3 parse_date_time    10 165.29ms 182.68ms   5.47      17.04MB   2.74       2     1   365.36ms
#  4 regex             100 364.25ms 366.44ms   2.73      10.74MB   0          2     0   732.89ms
#  5 no_regex          100 142.79ms 145.15ms   5.32      65.28MB  10.6        3     6   564.26ms
#  6 parse_date_time   100    1.95s    1.95s   0.514    195.77MB   2.06       1     4      1.95s
#  7 regex            1000     3.4s     3.4s   0.294    106.88MB   0          1     0       3.4s
#  8 no_regex         1000    1.62s    1.62s   0.619    652.36MB   0.619      1     1      1.62s
#  9 parse_date_time  1000   19.02s   19.02s   0.0526     1.93GB   0.473      1     9     19.02s
# 10 regex           10000   33.94s   33.94s   0.0295     1.04GB   0          1     0     33.94s
# 11 no_regex        10000   21.91s   21.91s   0.0456     6.37GB   0.137      1     3     21.91s
# 12 parse_date_time 10000    3.28m    3.28m   0.00508   15.67GB   0.0305     1     6      3.28m

Code to run the benchmark

results <- bench::press(
    size = sizes,
    {
        dates <- rep(dates, size)
        DT1 <- as.data.table(dates)
        DT2 <- as.data.table(dates)
        DT3 <- as.data.table(dates)


        bench::mark(
            min_iterations = 1,
            max_iterations = 100,
            check = FALSE,
            regex = {
                # replace all "/" charachters to "-"
                DT1[, dates2 := gsub("/", "-", dates)][]
                # reverse strings in the dd-mm-yyyy format to yyyy-mm-dd
                DT1[
                    grepl("^[0-9]{1,2}.[0-9]{1,2}.[0-9]{4}$", dates2),
                    dates2 := gsub("^([0-9]{1,2}).([0-9]{1,2}).([0-9]{4})$", "\\3-\\1-\\2", dates2)
                ][]
                # usse fastposix to convert
                DT1[, `:=`(
                    dates2 = NULL,
                    actualDate = fasttime::fastDate(dates2)
                )][]
            },
            no_regex = {
                DT2[, third_char := substr(dates, 3, 3)]
                DT2[, `:=`(
                    actualDate = fcase(
                        third_char == "/", as.Date(dates, format = "%m/%d/%Y"),
                        third_char == "-", as.Date(dates, format = "%m-%d-%Y"),
                        third_char %in% 1:9, fastDate(dates)
                    ),
                    third_char = NULL
                )]
            },
            parse_date_time = {
                DT3[, actualDate :=
                    fastDate(parse_date_time(
                        dates,
                        orders = c("ymd", "mdy")
                    ))]
            }
        )
    }
)

Code to generate the plot

library(ggplot2)
library(dplyr)

results |>
    rowwise() |>
    mutate(max = max(unlist(time))) |>
    ungroup() |>
    transmute(
        expression = attr(expression, "description"),
        size,
        median,
        min,
        max,
        gc = sapply(gc, \(x){
            gc_level <- max(max(which(colSums(x) > 0)), 0) + 1
            c("none", names(x))[gc_level]
        })
    ) |>
    ggplot(aes(group = expression)) +
    geom_pointrange(
        aes(
            x = expression,
            y = median,
            ymin = min,
            ymax = max,
            color = gc
        )
    ) +
    facet_wrap(
        vars(size),
        scales = "free_y"
    ) +
    theme_bw(base_size = 16) +
    expand_limits(y = 0)
SamR
  • 8,826
  • 3
  • 11
  • 33