2

I am trying to parse some timestamps (character vectors) as datetimes using R mutate and case_when.

Dummy data:

p_id = c(1,2,3,4,5,6)
ActualStartTime = c("2020-05-21 19:04:36 +01:00", "21/09/2020 14:14", "2020-08-18 10:11:08 +01:00", "12/10/2020 21:25", "09/11/2020 17:02","2020-05-16 11:50:58 +02:00")
ActualEndTime = c("2020-05-21 19:29:42 +01:00", "21/09/2020 14:19", "2020-08-18 10:14:26 +01:00", "12/10/2020 21:29", "09/11/2020 17:06", "2020-05-16 11:56:10 +02:00")
df <- data.frame(p_id,ActualStartTime, ActualEndTime)

df

  p_id            ActualStartTime              ActualEndTime
1    1 2020-05-21 19:04:36 +01:00 2020-05-21 19:29:42 +01:00
2    2           21/09/2020 14:14           21/09/2020 14:19
3    3 2020-08-18 10:11:08 +01:00 2020-08-18 10:14:26 +01:00
4    4           12/10/2020 21:25           12/10/2020 21:29
5    5           09/11/2020 17:02           09/11/2020 17:06
6    6 2020-05-16 11:50:58 +02:00 2020-05-16 11:56:10 +02:00

The timestamps are in two different formats, so I create a function without vectorising it to test it. If the length == 26 then it parses with one format, if the length is anything else it parses to the alternative format.

parse_mydate_novec <- function(time_var) {
  if (nchar(time_var) == 26) { 
    parse_date_time(time_var, orders = "%Y-%m-%d %H:%M:%S %z", tz = "UTC")
  } else {
    parse_date_time(time_var, orders = "%d/%m/%Y %H:%M", tz = "UTC")
  }
}

parse_mydate_novec(df$ActualStartTime[1]) # this works, class is POSIXct
[1] "2020-05-21 18:04:36 UTC"

> parse_mydate_novec(df$ActualStartTime[2]) # this works, class is POSIXct
[1] "2020-09-21 14:14:00 UTC"

So far, so good. I then try vectorising the function using the data masking guidance https://dplyr.tidyverse.org/reference/dplyr_data_masking.html, so I can use it with mutate and using case_when instead of if else:

parse_mydate <- function(time_var) {
 case_when (
   nchar({{time_var}}) == 26 ~ parse_date_time({{time_var}}, orders = "%Y-%m-%d %H:%M:%S %z", tz = "UTC"),
   nchar({{time_var}}) == 16 ~ parse_date_time({{time_var}}, orders = "%d/%m/%Y %H:%M", tz = "UTC"),
   TRUE ~ {{time_var}})
} 

I then pass this function using mutate, first on one column to test it and then using mutate(across()):

df_test <- df %>%
  mutate(ActualStartTime = parse_mydate(ActualStartTime))

df_test <- df %>%
  mutate(across(c(ActualStartTime, ActualEndTime), ~parse_mydate(.x)))

However I get the following errors:

Error in `mutate_cols()`:
! Problem with `mutate()` column `ActualStartTime`.
ℹ `ActualStartTime = parse_um_date(ActualStartTime)`.
x must be a `POSIXct/POSIXt` object, not a character vector.
Caused by error in `glubort()`:
! must be a `POSIXct/POSIXt` object, not a character vector.

Warning messages:
1: Problem with `mutate()` column `ActualStartTime`.
ℹ `ActualStartTime = parse_um_date(ActualStartTime)`.
ℹ  3 failed to parse. 
2: Problem with `mutate()` column `ActualStartTime`.
ℹ `ActualStartTime = parse_um_date(ActualStartTime)`.
ℹ  3 failed to parse. 

This doesn't make sense as I've written the function to pass in a character vector and return a datetime object.

The desired output is a dataframe where all the objects in ActualStartTime and ActualEndTime are in POSIXct format i.e. "2020-05-21 18:04:36 UTC"

I've looked at: R dplyr using across() efficiently with mutate() and case_when() and R - How to pass parameters to function in "mutate across"? and several other questions on parsing datetimes.

I don't know whether I have the logic of the function wrong, the use of case_when, the use of mutate or something else. I've been going round in circles for hours. All help appreciated! With thanks.

LucieCBurgess
  • 759
  • 5
  • 12
  • 26
  • You don't need data masking because `mutate` is doing the tidy evaluation and will pass the "tidy-evaluated" arguments to your custom function. – Stefano Barbi Mar 24 '22 at 11:17
  • Another problem with your approach is that `case_when` will evaluate all the branches of your "conditional". It will produce errors when the format is not the right one. – Stefano Barbi Mar 24 '22 at 11:19

1 Answers1

2

The function lubridate::fast_strptime allows the specification of more formats that will be applied in turn till success.

library(dplyr)
library(lubridate)

df %>%
  mutate(across(matches("Time"), ~fast_strptime(.x,
                                              format = c("%Y-%m-%d %H:%M:%S %z",
                                                         "%d/%m/%Y %H:%M"),
                                              tz = "UTC")))


##>   p_id     ActualStartTime       ActualEndTime
##> 1    1 2020-05-21 18:04:36 2020-05-21 18:29:42
##> 2    2 2020-09-21 14:14:00 2020-09-21 14:19:00
##> 3    3 2020-08-18 09:11:08 2020-08-18 09:14:26
##> 4    4 2020-10-12 21:25:00 2020-10-12 21:29:00
##> 5    5 2020-11-09 17:02:00 2020-11-09 17:06:00
##> 6    6 2020-05-16 09:50:58 2020-05-16 09:56:10
Stefano Barbi
  • 2,978
  • 1
  • 12
  • 11
  • Thank you @Stefano Barbi, this works nicely. But I don't understand why my case_when isn't working. I thought the TRUE statement was supposed to deal with the cases when the branch of a conditional was not met? Also in my columns of my dataframe, all the branches conform to one of the two case_when statements. Thanks for your help! – LucieCBurgess Mar 24 '22 at 13:05
  • @LucieCBurgess Hi Lucie! Consider this: `x <- 1:10; case_when(x > 0 ~ x, x <= 0 ~ as.character(x) + 3)`. It fails with error, although the condition `x<=0` is never met. – Stefano Barbi Mar 24 '22 at 14:03
  • @LucieCBurgess Because both right hand sides are evaluated on all the `x`s and then assigned to the output according to `x > 0`. If there is an error in either side the whole computation fails. – Stefano Barbi Mar 24 '22 at 14:10
  • @LucieCBurgess Finally (and funnily)¸you could have set `parse_mydate <- Vectorize(parse_mydate_novec)` and it would have worked. – Stefano Barbi Mar 24 '22 at 14:21