0

I have a data frame:

'data.frame':   2611029 obs. of  10 variables:
 $ eid              : int  28 28 28 28 28 36 36 36 36 37 ...
 $ created          : Factor w/ 36204 levels "0000-00-00 00:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
 $ class_id         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ min.e.event_time.: Factor w/ 16175 levels "2013-04-15 11:17:19",..: NA NA NA NA NA NA NA NA NA NA ...
 $ lead_date        : Factor w/ 11199 levels "2012-10-11 18:39:12",..: NA NA NA NA NA NA NA NA NA NA ...
 $ camp             : int  44698 44698 44699 44701 44701 44715 44715 44909 44909 44699 ...
 $ event_date       : Factor w/ 695747 levels "2008-01-18 12:18:01",..: 1 5 2 32 36 6 17039 23 24 2 ...
 $ event            : Factor w/ 3 levels "click","open",..: 3 2 3 3 2 3 2 3 2 3 ...
 $ message_name     : Factor w/ 2707 levels ""," 2015-03 CAD Promotion Update",..: 2163 2163 2163 1106 1106 2163 2163 1990 1990 2163 ...
 $ subject_lin      : Factor w/ 2043 levels ""," Christie Office Holiday Hours",..: 613 613 613 248 248 613 613 612 612 613 ...

Each line item is an instance of a user (eid) having received an email (event_date).

event_date, lead_date and created are all dates. Till now I have transformed these dates using as.Date() subsequent to subsetting the data so only records with complete.cases() of these dates. This allowed me to do aggregation and subsetting based conditionals e.g. where event_date < lead_date.

If I try to convert dates in data as is, without removing na values, I receive the message

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

The purpose of the analysis is to look at the impact of receiving an email on becoming a lead (thus lead_date would be populated, NA otherwise). I therefore don't want to exclude people who never became a lead by subsetting the entire df on complete lead dates.

But I still want to perform calculations on those records with dates, leaving the NAs as their own group.

Is there anything I can do here? I want R to ignore NA results when using functions like subset or aggregation. I also want to convert all the non NA dates into dates using as.Date()

** following posting** I probably could have asked this in a much simpler way: can I convert a field in a data frame to a date where it's feasible and ignore na values otherwise?

Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • Please check http://stackoverflow.com/questions/14755425/what-are-the-standard-unambiguous-date-formats – akrun Mar 23 '15 at 17:29
  • Hi again @akrun. I'm looking over it now, and at ?as.Date. I have a hard time understanding what to do with R documentation and find I rarely understand what to do next. In fact, whenever I read R documentation I usually end up more confused than I was before opening it. Is there a parameter that I can pass in as.Date() that says "ignore NAs"? – Doug Fir Mar 23 '15 at 17:36
  • I am not sure if this is connected with the `NA`. Otherwise, `v1 <- c('2008-01-01', '2009-05-02', NA); as.Date(v1) #[1] "2008-01-01" "2009-05-02" NA` should show error. But, this does take NA and gives NA for those elements that are NA – akrun Mar 23 '15 at 17:41
  • Also, as I mentioned earlier, it may be best to show few lines of your data using `dput`. – akrun Mar 23 '15 at 17:42
  • One of your factor levels is "0000-00-00 00:00:00". When I try to use `as.Date.factor`, that value causes the error you are seeing. Try `as.Date(factor("0000-00-00 00:00:00"))`. You may need to set those items to NA first. There is no year == 0000. (Arguably this should have happened automatically but if you do provide a format string coercion to NA does occur.) – IRTFM Mar 23 '15 at 18:53

1 Answers1

1

Replace all your as.Date( ) calls with as.Date( , format="%Y-%m-%d")

> as.Date(factor("0000-00-00 00:00:00"))
Error in charToDate(x) : 
  character string is not in a standard unambiguous format
> as.Date(factor("0000-00-00 00:00:00"), format="%Y-%m-%d")
[1] NA

Then describe the problems (code and errors) you encounter with the updated dataset. It's not possible to predict where you are getting stuck on the next steps from the description. There is an is.na function that cam be used in combination with other logical tests.

Do remember that is.na(NA) | NA will return TRUE. That doesn't work with & (AND) but will with OR.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks for the answer. I added the , format="%Y-%m-%d" parameter to my as.Date() function and all "worked" in that I received no errors. But it looks like R behaves in a way that when I subset based on this it will remove the NA records from the data frame, rather than treat them as their own group, which is what I was hoping for. – Doug Fir Mar 23 '15 at 20:39
  • 1
    As I said, you need to use `is.na`: `subset(df, is.na(dt) | dt >as.Date("2001-01-01"))` # the R OR symbol is `|` – IRTFM Mar 23 '15 at 20:54