1

I have a large dataset (a few millions observations) that contains a datetime variable with an inconsistent format: "%Y-%m-%d %H:%M:%S" ; "%m/%d/%Y and %H:%M:%S".

Here is how the dataset looks like:

df <- data.frame(var1 = c(1:6), 
   var2 = c("A", "B", "C", "A", "B", "C"), 
   datetime = c("2013-07-01 00:00:02", "2016-07-01 00:00:01", 
            "9/2/2014 00:01:20", "9/1/2014 00:00:25",
            "1/1/2015 0:07", "6/1/2015 0:01"))

Is there an efficient way to format the datetime variable into a unique, consistent date time format?

lmo
  • 37,904
  • 9
  • 56
  • 69
user24347
  • 11
  • 1
  • 1
    If you didn't have ambiguous dates (such as 9/2/2014), you could use `anytime` from the anytime package. As it is, you can only use `as.POSIXct` with all formats you have in your data and merge the results into one vector – Roland Sep 07 '17 at 14:50

2 Answers2

3

You can use lubridate package like this.

lubridate::parse_date_time(x = df$datetime, c("ymd HMS","mdy HMS"))

[1] "2013-07-01 00:00:02 UTC" "2016-07-01 00:00:01 UTC" "2014-09-02 00:01:20 UTC"
[4] "2014-09-01 00:00:25 UTC" NA                        NA                       
Warning message:
 2 failed to parse. 

lubridate::parse_date_time(x = df$datetime, c("ymd HMS","mdy HMS","mdy HM"))

[1] "2013-07-01 00:00:02 UTC" "2016-07-01 00:00:01 UTC" "2014-09-02 00:01:20 UTC"
[4] "2014-09-01 00:00:25 UTC" "2015-01-01 00:07:00 UTC" "2015-06-01 00:01:00 UTC"

You can specify your date-time formats as needed, you may compare two examples I mentioned.

Hope this helps you. :)

d.b
  • 32,245
  • 6
  • 36
  • 77
Akarsh Jain
  • 930
  • 10
  • 15
2

POSIXCT solution using parse_date_time.

EDIT: incorporating @Akarsh Jain's POSIXCT formatting for better time alignment.

df$new_date <- parse_date_time(df$datetime, c("%Y-%m-%d %H:%M:%S", "%m/%d/%Y %H:%M:%S", "%m/%d/%Y %H:%M"))
D.sen
  • 938
  • 5
  • 14