29

What I've got so far is a dataframe column with dates in different character formats. A few appear in the %d.%m.%Y pattern, some in %m/%d/%Y :

data$initialDiagnose = as.character(data$initialDiagnose)
data$initialDiagnose[1:10]

[1] "14.01.2009" "9/22/2005"  "4/21/2010" "28.01.2010" "09.01.2009" "3/28/2005" "04.01.2005" "04.01.2005" "9/17/2010" "03.01.2010"

I want them as Date() in one format, but R refuses of course.
So I tried at first to change them by the separator:

data$initialDiagnose[grep('/', data$initialDiagnose)] = as.character.Date(data$initialDiagnose[grep('/', data$initialDiagnose)], format = '%m/%d/%Y')

Analog to the '.' dates. But it didn't work.

How can I change them all to one format, that I can work with them?

leonheess
  • 16,068
  • 14
  • 77
  • 112
Rob
  • 291
  • 1
  • 3
  • 3

3 Answers3

29

Since MattBagg's answer in 2012, lubridate has added the parse_date_time function which is designed for exactly this situation and can resolve this problem in a single line:

library(lubridate) 

data <- data.frame(initialDiagnose = c("14.01.2009", "9/22/2005", 
        "4/21/2010", "28.01.2010", "09.01.2009", "3/28/2005", 
        "04.01.2005", "04.01.2005", "Created on 9/17/2010", "03 01 2010"))

parse_date_time(data$initialDiagnose, orders = c('mdy', 'dmy'))

 [1] "2009-01-14 UTC" "2005-09-22 UTC" "2010-04-21 UTC" "2010-01-28 UTC" "2009-01-09 UTC"
 [6] "2005-03-28 UTC" "2005-01-04 UTC" "2005-01-04 UTC" "2010-09-17 UTC" "2010-03-01 UTC"

The orders= argument is a character vector containing the possible date-time parsing formats in the order they should be tested. So by giving c('mdy', 'dmy'), lubridate will try to parse all strings as Month, Date, Year format. If it can't do that successfully (for example, the date 14.01.2009 won't work as there is no 14th month), it will try the next in the list.

The order in which it tries formats is a bit complicated, but you generally don't have to worry about it. parse_date_time2 simply tries them in the order provided. parse_date_time takes a subset of the input strings and uses that as a training set to find the best performing set formats, which orders based on the function given by the select_formats argument, which by default prioritizes the most complex formats (ie, those with the most format tokens)

divibisan
  • 11,659
  • 11
  • 40
  • 58
28

I like lubridate for its ease of use:

library(lubridate) 

# note added ugly formats below
data <- data.frame(initialDiagnose = c("14.01.2009", "9/22/2005", 
        "4/21/2010", "28.01.2010", "09.01.2009", "3/28/2005", 
        "04.01.2005", "04.01.2005", "Created on 9/17/2010", "03 01 2010"))

mdy <- mdy(data$initialDiagnose) 
dmy <- dmy(data$initialDiagnose) 
mdy[is.na(mdy)] <- dmy[is.na(mdy)] # some dates are ambiguous, here we give 
data$initialDiagnose <- mdy        # mdy precedence over dmy
data
#   initialDiagnose
#       2009-01-14
#       2005-09-22
#       2010-04-21
#       2010-01-28
#       2009-09-01
#       2005-03-28
#       2005-04-01
#       2005-04-01
#       2010-09-17
#       2010-03-01
MattBagg
  • 10,268
  • 3
  • 40
  • 47
22
a <- as.Date(data$initialDiagnose,format="%m/%d/%Y") # Produces NA when format is not "%m/%d/%Y"
b <- as.Date(data$initialDiagnose,format="%d.%m.%Y") # Produces NA when format is not "%d.%m.%Y"
a[is.na(a)] <- b[!is.na(b)] # Combine both while keeping their ranks
data$initialDiagnose <- a # Put it back in your dataframe
data$initialDiagnose
[1] "2009-01-14" "2005-09-22" "2010-04-21" "2010-01-28" "2009-01-09" "2005-03-28" "2005-01-04" "2005-01-04" "2010-09-17" "2010-01-03"

Additionnaly here's the preceding method adapted to a situation where you have three (or more) different formats:

data$initialDiagnose
[1] 14.01.2009 9/22/2005  12 Mar 97  4/21/2010  28.01.2010 09.01.2009 3/28/2005 
Levels: 09.01.2009 12 Mar 97 14.01.2009 28.01.2010 3/28/2005 4/21/2010 9/22/2005

multidate <- function(data, formats){
    a<-list()
    for(i in 1:length(formats)){
        a[[i]]<- as.Date(data,format=formats[i])
        a[[1]][!is.na(a[[i]])]<-a[[i]][!is.na(a[[i]])]
        }
    a[[1]]
    }

data$initialDiagnose <- multidate(data$initialDiagnose, 
                                  c("%m/%d/%Y","%d.%m.%Y","%d %b %y"))
data$initialDiagnose
[1] "2009-01-14" "2005-09-22" "1997-03-12" "2010-04-21" "2010-01-28" "2009-01-09" "2005-03-28"
plannapus
  • 18,529
  • 4
  • 72
  • 94
  • Is this function robust? Presumably the date formats as argument should be ordered by frequency of occurrence in the initialDiagnose. Thanks! – JasperK May 18 '20 at 16:25
  • I don't think the frequency of occurrence is an issue as long as there is no mutually ambiguous formats (i. e. 2010-02-03 as potentially `%Y-%m-%d` or `%Y-%d-%m` for instance). – plannapus May 19 '20 at 08:51