0

I am converting from chr to POSIXct formatted as "%Y-%m-%d %H:%M:%S however, some of the times in the dataset do not have the seconds portion (%S) so when I convert to DateTime the times without the seconds are returned as empty cells - NA How do I make sure this does not happen. I want them all returned as DateTime regardless of if some time portions are missing or not?

The date-times that are in this format ("%Y-%m-%d %H:%M:%S")are returned correctly as POSIXct

But the date-times that are in this format ("%Y-%m-%d %H:%M") are returned as NA

This is the code is used for the conversion trips$ended_at <- as.POSIXct(trips$ended_at, format = "%Y-%m-%d %H:%M:%S") It is a huge dataset with over a million entries so i don't even know which datetimes don't have the seconds portion. Is there a way that those without the seconds' portion can just have zeroes and the end? for example, 2020-29-04 01:57 will be returned as 2020-29-04 01:57:00 when converted to POSIXct

Please help!

Datababe
  • 15
  • 4
  • is it always the case that its seconds and only seconds that are missing or are there cases where minutes/hours etc are missing too? e.g. `2020/01/18 15:NA:12` or `2020/01/18 15` – rg255 Dec 10 '21 at 12:29
  • 1
    I find it a odd that you show dates in the format `"29/04/2020"` and then say that `"%Y-%m-%d"` *works*. – r2evans Dec 10 '21 at 12:31
  • @rg255 yes it is always seconds that are missing – Datababe Dec 10 '21 at 12:41
  • yeah my bad i will edit that @r2evans – Datababe Dec 10 '21 at 12:42
  • I see your edit, thanks, but ... `"2020/29/04" != "%Y-%m-%d"`, the question examples are still inconsistent with your assertion of what works. Regardless, the code in my answer (method 2) will handle either of those formats directly. – r2evans Dec 10 '21 at 12:48
  • Okay will try method 2 and see @r2evans – Datababe Dec 10 '21 at 12:49

3 Answers3

2

Two methods:

  1. Concatenate the literal :00 onto the end of timestamps that only have hour/minute:

    as.POSIXct(trips$ended_at, format = "%Y-%m-%d %H:%M:%S")
    # [1] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [3] "2020-04-29 01:57:00 EDT" NA                       
    # [5] "2020-04-29 01:57:00 EDT"
    
    gsub("( [0-9]+:[0-9]+)$", "\\1:00", trips$ended_at)
    # [1] "2020-04-29 01:57:00" "2020-04-29 01:57:00" "2020-04-29 01:57:00"
    # [4] "2020-04-29 01:57:00" "2020-04-29 01:57:00"
    
    as.POSIXct(gsub("( [0-9]+:[0-9]+)$", "\\1:00", trips$ended_at), format = "%Y-%m-%d %H:%M:%S")
    # [1] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [3] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [5] "2020-04-29 01:57:00 EDT"
    
  2. If you have multiple "candidate" formats that you need to try, you can iteratively step through them. This loop iterates through the formats, place the most-likely candidates first. If at any point all timestamps have been converted, it early-exits the for loop.

    candidates <- c("%Y-%m-%d %H:%M", "%d/%m/%Y %H:%M:%S", "%d/%m/%Y %H:%M")
    out <- as.POSIXct(trips$ended_at, format = "%Y-%m-%d %H:%M:%S")
    for (fmt in candidates) {
      if (!length(isna <- is.na(out))) break
      out[isna] <- as.POSIXct(trips$ended_at[isna], format = fmt)
    }
    out
    # [1] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [3] "2020-04-29 01:57:00 EDT" "2020-04-29 01:57:00 EDT"
    # [5] "2020-04-29 01:57:00 EDT"
    

Data

trips <- data.frame(ended_at = c("2020-04-29 01:57:00", "2020-04-29 01:57:00", "2020-04-29 01:57:00", "2020-04-29 01:57", "2020-04-29 01:57:00"))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I tried your suggestion with `"( [0-9]+:[0-9]+)$", "\\1:00"` but it doesnt seem to add the `:00` for timestamps that dont have the seconds. Is that the syntax? – thentangler Jun 23 '22 at 18:57
  • @thentangler, I don't know what your data looks like. I suggest you ask a new question with sample data and code attempted. – r2evans Jun 23 '22 at 20:55
  • I did here: https://stackoverflow.com/questions/72735362/handling-timestamps-without-seconds-in-some-rows-using-r?noredirect=1#comment128474942_72735362 Im guessing it has to do with the regular expression im using? – thentangler Jun 24 '22 at 12:59
0

Here's what I usually do. Check the length of the character strings prior to POSIXct formatting via nchar(), store the results in a new column, for example:

trips$check<-nchar(trips$ended_at)

Then check that all trips$ended_at have the same length and add the missing seconds for those who don't:

trips$ended_at_new<-ifelse(trips$check==19,trips$ended_at,paste(trips$ended_at,":00",sep=""))

You can swap out the 19 for whatever datetime format you are using. Important note: This only works, if seconds are missing at the end of your timestamp, not if the timestamp is less than 19 characters long for any other reason.

  • `nchar` is a good attempt for doing this, but it does suffer a small issue: if days or months are single digit instead of 0-padded, then it shortens the overall length of the string. For instance, `as.POSIXct("29/4/2020 5:00:00", format = "%d/%m/%Y %H:%M:%S")` works fine yet with your `ifelse` it will have `:00` appended to the string. – r2evans Dec 10 '21 at 12:40
  • 1
    Good point. It should work in this case though, as the posters' example showed 0-padding. For general use your solution is more elegant. Thanks for the inspiration! –  Dec 10 '21 at 12:50
0

This is the approach I took, using ifelse() under the assumption that you are dealing with two possibilities - with and without seconds

date_time <- c("2020-01-18 20:12:16", "2020-01-18 20:12")

ifelse(nchar(date_time) == 16, 
       format(as.POSIXct(date_time, format="%Y-%m-%d %H:%M"), "%Y-%m-%d %H:%M:%S"), 
       format(as.POSIXct(date_time, format="%Y-%m-%d %H:%M:%S"), "%Y-%m-%d %H:%M:%S"))
rg255
  • 4,119
  • 3
  • 22
  • 40
  • See my [comment](https://stackoverflow.com/questions/70304425/r-datetime-series-missing-values/70304571#comment124278515_70304580) below about the use of `nchar` in this; subject to single-digits in day, month, and hour (not uncommon glitches in data). – r2evans Dec 10 '21 at 12:41
  • Two other things: (1) You can gain efficiency by `ifelse`-ing only the format itself, so `ifelse(..., "%Y..%S", "%Y..%M")`, and move `format(as.POSIXct(date_time, ..))` *outside* of the `ifelse`. In the current case, `format(as.POSIXct` is called twice as many times as needed. (2) `format(.)` breaks this and returns a string instead of a `POSIXt`-class object. I suggest just `as.POSIXct(date_time, format = ifelse(==,"",""))`. – r2evans Dec 10 '21 at 12:46