0

This may seem like a silly question, but it has caused me days of time firstly identifying what the problem is and now trying to sort it out!

I have a dataframe with a column filled with datetimes. This dataframe is a product of a bunch of different processing steps applied to a series of individual files that were at some point merged into a single file (the individuals are identified by an id column). As the files must have been treated differently at some point, the format of the datetime column differs between the unique id blocks - i.e. some cells have the format "%Y-%m-%d %H:%M:%S" and some have the format "%d/%m/%Y %H:%M:%S" even though all look like they are in the second of the two formats when I open the csv file.

I want to load the data and split it by the unique id values before then doing a bunch of other stuff to it based on the datetime column. Unfortunately, no matter what I try to do, the data won't all be coerced into a single type of datetime format. I have tried reformatting the column in the original csv file, or loading the data and reclassing it as a POSIXct object with a single format, but when I do this those id blocks in the format that I am not specifying with the POSIXct command return as NAs (obviously!). I have also tried unclassing the data into a numeric format. Nothing I try seems to work, and my dataframe is too big to go through each id block separately! There has to be a way to do this and it has to be simple! Please someone put me out of my misery! Many thanks.

Example data:

> dput(t)

  row_id            datetime id
1 165656 09/02/2017 15:50:55  1
2 165657 09/02/2017 15:51:25  1
3 165658 09/02/2017 15:51:55  1
4 165659 09/02/2017 15:52:25  1
5 165660 09/02/2017 15:52:55  1
6 165661 2017-02-09 15:53:25  2
7 165662 2017-02-09 15:53:55  2
8 165663 2017-02-09 15:54:25  2
9 165664 2017-02-09 15:54:55  2

I have tried the below in both datetime formats (both of which only work for one of the two):

>t$datetime = as.POSIXct(strptime(t$datetime, format="%Y-%m-%d %H:%M:%S"), tz="UTC")
>t$datetime
[1] NA                        NA                        NA                       
[4] NA                        NA                        "2017-02-09 15:53:25 UTC"
[7] "2017-02-09 15:53:55 UTC" "2017-02-09 15:54:25 UTC" "2017-02-09 15:54:55 UTC"

>t$datetime = unclass(as.POSIXct(strptime(t$datetime, "%Y-%m-%d %H:%M:%S"))) 
jjulip
  • 1,093
  • 4
  • 16
  • 24

2 Answers2

2

You could use ifelse combined with grepl to check for a particular date format mask before calling as.POSIXct:

t$dt <- ifelse(grepl("\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2}", t$datetime),
               as.POSIXct(strptime(t$datetime, "%d/%m/%Y %H:%M:%S")),
               as.POSIXct(strptime(t$datetime, "%Y-%m-%d %H:%M:%S")))

This assumes that you only have two date formats present. If there could be there or more, we would have to change the above solution to take that into account.

Here is some sample data to show that this can be made to work:

t <- data.frame(datetime=c("09/02/2017 15:50:55", "2017-02-09 15:50:55"))
t$dt <- ifelse(grepl("\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2}", t$datetime),
               as.POSIXct(strptime(t$datetime, "%d/%m/%Y %H:%M:%S")),
               as.POSIXct(strptime(t$datetime, "%Y-%m-%d %H:%M:%S")))
t

             datetime         dt
1 09/02/2017 15:50:55 1486651855   <-- same values for dt, as expected
2 2017-02-09 15:50:55 1486651855
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
2

We can use parse_date_time from lubridate and specify various formats data can take.

library(lubridate)
df$datetime <- parse_date_time(df$datetime,c("%d/%m/%Y %T", "%Y-%m-%d %T"))
df$datetime

#[1] "2017-02-09 15:50:55 UTC" "2017-02-09 15:51:25 UTC" "2017-02-09 15:51:55 UTC"
#[4] "2017-02-09 15:52:25 UTC" "2017-02-09 15:52:55 UTC" "2017-02-09 15:53:25 UTC"
#[7] "2017-02-09 15:53:55 UTC" "2017-02-09 15:54:25 UTC" "2017-02-09 15:54:55 UTC"

class(df$datetime)
#[1] "POSIXct" "POSIXt" 

data

df <- structure(list(row_id = structure(1:9, .Label = c("1 165656", 
"2 165657", "3 165658", "4 165659", "5 165660", "6 165661", "7 165662", 
"8 165663", "9 165664"), class = "factor"), datetime = structure(1:9,
.Label = c("09/02/2017 15:50:55", "09/02/2017 15:51:25", "09/02/2017 15:51:55", 
"09/02/2017 15:52:25", "09/02/2017 15:52:55", "2017-02-09 15:53:25", 
"2017-02-09 15:53:55", "2017-02-09 15:54:25", "2017-02-09 15:54:55"), 
class = "factor"), id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L)), 
class = "data.frame", row.names = c(NA, -9L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213