0

I have a column of roughly 1300 characters which I need to extract a single date from, if the character contains a date (i.e. if NA then no date to be taken) and if it contains multiple dates, I only need one; if it contains an interval of dates, I don't need the dates.

For example, here are 10 observations that best exemplify the different cases with what I'd like to get as comments to the side:

string1 <- 'Eff. 1/1/96 ACWD Res #96-006 Service' # need the date
string2 <- 'NA' # irrelevant 
string3 <-'Effective 2/1/07' # need the date
string4 <- 'Effective: 3/01/2011' # need the date
string5 <- 'Eff. July 1, 1995 Ord. #92 Includes Cap Exp Ch' # need the date
string6 <- 'Effective: 2010-11' # need the date
string7 <- 'Eff. January 02' # need the date 
string8 <- 'Effective 1/1/09 Billing (svc prd 10/15 - 12/15/08)' # need first date only, not intervals
string9 <- 'Eff. 9/1/95 Resolution No. 63-95 1st 1000 g. free' # need the date
string10 <- '(svc prd 10/15-12/15/08)' # don't need interval dates

So strings 1 & 3 & 8 & 9 (same formats), string4, string5, string6, & string7 have different date formats. Further, string6, & string7 have more issues. string6 can be taken as 1/1/10(generally as 1/1/FIRST YEAR), whereas string7 has a year which can be identified by another character column, named FY containing values such as FY 9596, then string7 can be taken as 1/2/95.

Desired output for the 10 strings should be: (they could all also be YYYY-MM-DD, that doesn't matter much so long as they are consistent)

1/1/96
NULL
2/1/07
3/1/11
7/1/95
1/1/10
1/2/95
1/1/09
9/1/95
NULL

When I test it on the 10 all at once, using the following

for(j in 1:10){
strapplyc(string[j], "\\d+/\\d+/\\d+", simplify = TRUE)
}

I get the following due to the structural differences in the instances date formats:

Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { : 
  missing value where TRUE/FALSE needed

In particular, string5,string6,string7 fails to return what I need, as expected I get NULL instead; Further, string8 fails to return what I need, as I get

      [,1]      
[1,] "1/1/09"  
[2,] "12/15/08"

Lastly, string10 fails to return what I need, I instead get 12/15/08.

Is the most efficient approach to mutate with ifelse for string5,string6,string7??; for string10 I thought to assign NULL if a date is preceded by - as I thought that is likely signaling an interval which is irrelevant for my purpose but string6 contains a hyphen, which I need.

To the best of my knowledge I see somewhat related posts here here, & here. But thought this case was sufficiently different. Apologies in advance if that is not the case.

Any help is very much appreciated !!

yungmist
  • 13
  • 5
  • 2
    Well, from a programatic standpoint, I cannot really add more than your links. This is really cumbersome work to solve it programmatically. You would need many regexes and tiebreakers. If your structure is really that various, you might be better of to extract the most common cases programmatically and code the edge cases by hand. Even if we would provide you a function that solves all the issues in your example, it is very likely that there are other issues in your actual data. – mnist Jun 04 '20 at 22:22
  • Thank you, I definitely understand this is a tedious case. It seems all the relevant cases can be boiled down to atleast "Eff" preceding some sort of date format and any other subsequent date can then be ignored. Further, if there is no "Eff" it can be assigned NULL/ignored altogether. – yungmist Jun 04 '20 at 22:51

1 Answers1

0

Based off of @mnist's comment and a recognized pattern in my subsequent comment, I split the data (let myData denote my data frame and String denote the column of all 1300 string observations) with grepl

myData <- myData %>% filter(grepl("Eff|eff|Ef",String))

Then I again split myData into 2 subsets, with Case 1 (nice case) corresponding to filter(grepl("\\d+/\\d+/\\d+", String)) and Case 2 corresponding to filter(!grepl("\\d+/\\d+/\\d+", String)) respectively. As it turns out, Case 2 (annoying case) only amounts to 3% of the observations (<50 obs) which I suppose I will deal with manually since it is not much.

Turns out Case 1 only had one observation like string8 so I corrected that manually.

yungmist
  • 13
  • 5