2

I'm looking for an efficient way to convert a column of strings in a data table into a column of dates with the proviso that the strings may be in one of three date formats -- number, %Y-%m-%d, %m/%d/%Y.

The following illustrates how a hypothetical function datefun would behave:

library(data.table)
dt <- data.table( my_dates = c('42292.7894','2014-06-22','11/25/2011','33661', NA))

datefun(dt$my_dates)
[1] "2015-10-15" "2014-06-22" "2011-11-25" "1992-02-27" NA 

which would be the same as applying as.Date to each string with knowledge of the format of that string.

 l <- dt$my_dates
 c( as.Date(as.numeric(l[1]), origin = "1899-12-30"), 
    as.Date(l[2],'%Y-%m-%d'), 
    as.Date(l[3],'%m/%d/%Y'), 
    as.Date(as.numeric(l[4]), origin = "1899-12-30"),
    as.Date(l[5]))
[1] "2015-10-15" "2014-06-22" "2011-11-25" "1992-02-27" NA   

I'm attempting to read data directly from excel which has not been formatted consistently.

jbaums
  • 27,115
  • 5
  • 79
  • 119
Kerry
  • 411
  • 4
  • 13
  • You will need to use regex matching and conditional conversion. For example, `stringr` package's `str_extract(dt, '\\d+-\\d+-\\d+')` would extract all 2015-10-15 strings. You can apply as.Date() to them. `str_extract(dt, '\\d+.\\d+')` for the kind 42292.7894, etc. If you want to enforce certain fixed number of digits, you can use better pattern matching. – Gopala Feb 25 '16 at 23:04
  • 1
    There have been a few questions on SO about similar problems. See if any of these help: [one](http://stackoverflow.com/questions/34389139/format-multiple-date-formats-in-one-columns-using-lubridate), [two](http://stackoverflow.com/questions/25463523/convert-variable-with-mixed-date-formats-to-one-format-in-r), [three](http://stackoverflow.com/questions/13764514/how-to-change-multiple-date-formats-in-same-column). – eipi10 Feb 25 '16 at 23:11
  • Thanks Gopala for your suggestions and eipi10 for the references. Very helpful and enlightening. – Kerry Feb 26 '16 at 01:43

1 Answers1

1

lubridate is handy for this. I think the other questions on this topic don't explicitly handle decimal days since origin, so here goes:

library(lubridate)
d <- parse_date_time(l, c('%Y-%m-%d', '%m/%d/%Y'))
d[is.na(d)] <- (ymd_hms("1899-12-30 00:00:00") + as.numeric(l) * 3600 * 24)[is.na(d]
d

## [1] "2015-10-15 18:56:44 UTC" "2014-06-22 00:00:00 UTC" "2011-11-25 00:00:00 UTC"
## [4] "1992-02-27 00:00:00 UTC" NA 

This assumes that any elements of l that are coercible to numeric are in the decimal days since origin format (with a consistent origin).

jbaums
  • 27,115
  • 5
  • 79
  • 119