How to fast convert different time formats in large data frames?

Question

I want to calculate length in different time dimensions but I have problems dealing with the two slightly different time formats in my data frame column.

The original data frame column has about a million rows with the two formats (shown in the example code) mixed up .

Example code:

time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z",
         "2018-10-04T12:13:41.333Z", "2018-10-04T12:13:45.479Z")

length <- c(15.8, 132.1, 12.5, 33.2)

df <- data.frame(time, length)

df$time <- format(as.POSIXlt(strptime(df$time,"%Y-%m-%dT%H:%M:%SZ", tz="")))
df

The formats "2018-10-04T12:13:41.333Z" and "2018-10-04T12:13:45.479Z" result in NA.

Is there a solution that would also be applicable to a big data frame where the two formats are mixed up?

https://stackoverflow.com/questions/34389139/format-multiple-date-formats-in-one-columns-using-lubridate — Harshal Gajare, Jun 25 '19 at 12:22
In base R you could remove values after dots with `gsub("\\..*Z", "Z", time)` — Clemsang, Jun 25 '19 at 12:28

jay.sf · Accepted Answer · 2019-06-28T15:06:11.503

We may use %OS instead of %S to account for decimals in seconds.

help("strptime")

Specific to R is %OSn, which for output gives the seconds truncated to 0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it uses the setting of getOption("digits.secs"), or if that is unset, n = 0).

as.POSIXct(time, format="%Y-%m-%dT%H:%M:%OSZ")
# [1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST"
# [3] "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"

This base R code is considerably faster than the package solutions, try it yourself.

Update 1

time2 <- c("2018-09-01T12:42:37.000+02:00", "2018-10-01T11:42:37.000+03:00")

This one is trickier. ?strptime says we should use %z for offsets from UTC, but somehow it won't work with as.POSIXct. Instead we could do this,

as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
  {os <- as.numeric(el(strsplit(substring(time2, 24), "\\:")))
  (os[1]*60 + os[2])*60}
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"

which cuts the unreadable part from the string, converts it to seconds and adds it to the "POSIXct" object.

If there are only hours as in time2, we could also say:

as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
  as.numeric(substr(time2, 24, 26))*3600
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"

That the code is slightly longer now should not obscure the fact that it runs practically as fast as the one at top of the answer.

Update 2

You could wrap the current three variants into a function with if (nchar(x) == 29) ... else structure, such as this one:

fixDateTime <- function(x) {
  s <- split(x, nchar(x))
  if ("20" %in% names(s))
    s$`20` <- as.POSIXct(s$`20` , format="%Y-%m-%dT%H:%M:%SZ")
  else if ("24" %in% names(s))
    s$`24` <- as.POSIXct(s$`24`, format="%Y-%m-%dT%H:%M:%OSZ")
  else if ("29" %in% names(s))
    s$`29` <- as.POSIXct(substr(s$`29`, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
      {os <- as.numeric(el(strsplit(substring(s[[3]], 24), "\\:")))
      (os[1]*60 + os[2])*60}
  return(unsplit(s, nchar(x)))
}

res <- fixDateTime(time3)
res
# [1] "2018-07-29 15:02:05 CEST" "2018-10-04 00:00:00 CEST" "2018-10-01 00:00:00 CEST"
str(res)
# POSIXct[1:3], format: "2018-07-29 15:02:05" "2018-10-04 00:00:00" "2018-10-01 00:00:00"

Compared to the packages only fixDateTime can handle all three defined date-time types. According to the concluding benchmark the function is still very fast.

Note: The function logically fails if different date formats have the same nchar, and it should be customized in the case (e.g. by another split condition)! Not tested: daylight saving time behavior when adding seconds to POSIXct.

Benchmark

# Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval  cld
# fixDateTime  35.46387  35.94761  40.07578  36.05923  39.54706  68.46211    10   c 
#  as.POSIXct  20.32820  20.45985  21.00461  20.62237  21.16019  23.56434    10  b   # to compare
#   lubridate  11.59311  11.68956  12.88880  12.01077  13.76151  16.54479    10 a    # produces NAs! 
#     anytime 198.57292 201.06483 203.95131 202.91368 203.62130 212.83272    10    d # produces NAs!

Data

time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z", "2018-10-04T12:13:41.333Z", 
"2018-10-04T12:13:45.479Z")
time2 <- c("2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z") 
time3 <- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z", 
           "2018-10-01T11:42:37.000+03:00")

Benchmark code

n <-  1e3
t1 <- sample(time2, n, replace=TRUE)
t2 <- sample(time3, n, replace=TRUE)

library(lubridate)
library(anytime)
microbenchmark::microbenchmark(fixDateTime=fixDateTime(t2),
                               as.POSIXct=as.POSIXct(t1, format="%Y-%m-%dT%H:%M:%OSZ"),
                               lubridate=parse_date_time(t2, "ymd_HMS"),
                               anytime=anytime(t2),
                               times=10L)

wow thats really frustrating - I have spend quite some time on this and now you´re telling me there is one additional letter I was missing Thanks! — Lutz, Jun 25 '19 at 13:30
@Lasse You are welcome. It's a little hidden in the documentation of `?strptime`, penultimate paragraph before section "Value". — jay.sf, Jun 25 '19 at 13:36
Sorry to bother you again but I just came across another date format in my data frame column. Some of the rows have a date like this ```2018-09-01T12:42:37.000+02:00```. Is it possible to include that in ```as.POSIXct()``` as well? — Lutz, Jun 27 '19 at 16:16
@Lasse see update, I hope it's what you wanted. You could wrap this in a function with `if (nchar(x) == 29) ... else` structure. — jay.sf, Jun 27 '19 at 17:17
I think I get your idea - yet the `if...else` structure does not seem to work for all formats (I also don´t know what `r` wants to tell me with `Error: the condition has length > 1 and only the first element will be used` in this particular case. **My Example:** `t<- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z", "2018-10-01T11:42:37.000+03:00")` and the statement: `if (nchar(t)==29) {as.POSIXct(substr(t, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + as.numeric(substr(t, 24, 26))*3600} else {as.POSIXct(t, format="%Y-%m-%dT%H:%M:%OSZ")}` I might also put this in another question. THANKS! — Lutz, Jun 28 '19 at 13:06

score 2 · Answer 2 · answered Jun 25 '19 at 12:24

2

You can use library anytime

    library(anytime)
    time<- c("2018-07-29T15:02:05Z",
             "2018-07-29T14:46:57Z",
             "2018-10-04T12:13:41.333Z",
             "2018-10-04T12:13:45.479Z")
    anytime(time)
#[1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST" "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"

answered Jun 25 '19 at 12:24

LocoGris

4,432
3
15
30

1

thank you - unfortunately it takes ages for my big dataset to apply this command – Lutz Jun 25 '19 at 12:41

score 2 · Answer 3 · answered Jun 25 '19 at 12:30

or you can also use:

time<- c("2018-07-29T15:02:05Z",
         "2018-07-29T14:46:57Z",
         "2018-10-04T12:13:41.333Z",
         "2018-10-04T12:13:45.479Z")

length<-c(15.8,132.1,12.5,33.2)

df<-data.frame(time,length)
library(lubridate)

# df$time2<-as_datetime(df$time)
df$time2 <-parse_date_time(df$time, "ymd_HMS") 
df