1

I'm fairly new to R, but I've had a great deal of challenge doing what otherwise seems like a simple task.

I have a number of inconsistently recorded dates in one vector. My goal is to convert them all to the same R accepted date format in a new vector labeled clean_end_date. Thus far, I have been able to use a combination of as.Date and grep to format almost all into the general %Y-%m-%d format in R:

d$clean_end_date[grep("[0-9]{2}/[0-9]{2}/[0-9]{4}", d$End_Date, value=F)] <- as.character(as.Date(grep("[0-9]{2}/[0-9]{2}/[0-9]{4}", d$End_Date, value=T, fixed = FALSE), format="%d/%m/%Y"))

However, there are several dates that won't convert regardless of how I manipulate the code. The rows that produce NAs use dashes instead of slashes as separators. Different date formats have the same number of characters, so counting them won’t help distinguish between them. I can either identify these rows manually, or use a regular expressions string matching function to identify the format. I thought grep() would help here, but it does not.

The code I have used to format most of it is here:

d$clean_end_date <- NA
d$clean_end_date[nchar(as.character(d$End_Date))<10] <- as.character(as.Date(d$End_Date[nchar(as.character(d$End_Date))<10], format="%m/%d/%y"))
d$clean_end_date[grep("[0-9]{4}/[0-9]{2}/[0-9]{2}", d$End_Date, value=F)] <- as.character(as.Date(grep("[0-9]{4}/[0-9]{2}/[0-9]{2}", d$End_Date, value=T, fixed = FALSE), format="%Y/%m/%d"))
d$clean_end_date[grep("[0-9]{4}-[0-9]{2}-[0-9]{2}", d$End_Date, value=F)] <- as.character(as.Date(grep("[0-9]{4}-[0-9]{2}-[0-9]{2}", d$End_Date, value=T, fixed = FALSE), format="%Y-%m-%d"))
d$clean_end_date[grep("[0-9]{2}/[0-9]{2}/[0-9]{4}", d$End_Date, value=F)] <- as.character(as.Date(grep("[0-9]{2}/[0-9]{2}/[0-9]{4}", d$End_Date, value=T, fixed = FALSE), format="%d/%m/%Y"))
d$clean_end_date[d$Community_id==42 & nchar(as.character(d$End_Date))==10] <- as.character(as.Date(d$End_Date[d$Community_id==42 & nchar(as.character(d$End_Date))==10], format="%m/%d/%Y")) 

However, I am having a difficulty formatting dates in the format "11/31/2015" and "2014-02-29". I suspect this might be because R cannot distinguish between this format and formats of the same length like "2015/11/31" and "02-29-2014".

I would really appreciate help on this. I am fairly new to R, so would especially appreciate an answer that doesn't assume I speak R language fluently.

oguz ismail
  • 1
  • 16
  • 47
  • 69
CConroy
  • 11
  • 2
  • 2
    You might want to have a look at the guidance for asking R questions here: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 – Frank Nov 21 '16 at 21:15
  • 1
    You might check out [`anytime::anydate`](http://dirk.eddelbuettel.com/code/anytime.html) or [`lubridate::parse_date_time`](https://github.com/hadley/lubridate), which can handle multiple orders. – alistaire Nov 21 '16 at 21:18

1 Answers1

1

I would suggest reflecting on the guess_format function available in the aforementioned lubridate package.

Examples

Quoting example from the official help linked above, given a set of dates:

x <- c('February 20th 1973',
       "february  14, 2004",
       "Sunday, May 1, 2000",
       "Sunday, May 1, 2000",
       "february  14, 04",
       'Feb 20th 73',
       "January 5 1999 at 7pm")

we can apply guess_format in the following manner:

require(lubridate)
as.Date(x = x, format = guess_formats(x, "mdy"))

Results

This will match some of the dates:

>> as.Date(x = x, format = guess_formats(x, "mdy"))
 [1] "1973-02-20" "2004-02-14" "2000-05-01" "2000-05-01" "2004-02-14" "1973-02-20" NA           NA          
 [9] NA           "2000-05-01" NA           NA 

You can explore different approaches to guess_format, on principle, it should result in a more parsimonious code than multiple combinations of grep

Konrad
  • 17,740
  • 16
  • 106
  • 167