1

I'm working on a data-munging pipeline with a lot of date columns in the data. Many R functions (e.g., set operations, sapply, etc.) do no preserve date class, converting the dates to integers.

The strategies I see to deal with this are:

  1. Making sure that each function in the data-munging pipeline accepts and returns dates formatted as dates. Disadvantage: figuring out all the places to stick as.Date() is often tedious.
  2. Living with dates as integers in all the munging steps, converting them to dates only at the end. This has the disadvantage of making date manipulation (e.g., sequencing with by = "month") in the intermediate munging steps impossible without first converting to date.

Any other options I'm missing? Is there a way to make R play nice with dates? To clarify, the data I'm dealing with is not just a time series: multiple columns contain dates. So, as far as I can tell, xts is of limited usefulness.

Victor Kostyuk
  • 621
  • 5
  • 16
  • A third option would be to live with dates as integers, unless you need them as a date class for a specific data munging step, e.g. when you need `by = "month"`.This might be a compromise between (1) and (2) that is less tedious, less error prone because you don't assume on automatic casting, yet flexible enough to do your munging chores. – tophcito Jan 01 '18 at 16:21
  • 1
    Perhaps [How to iterate over list of Dates without coercion to numeric in R?](https://stackoverflow.com/questions/14527564/how-to-iterate-over-list-of-dates-without-coercion-to-numeric-in-r) and links therein can be helpful. And the good'ol [`ifelse` "trap"](https://stackoverflow.com/questions/6668963/how-to-prevent-ifelse-from-turning-date-objects-into-numeric-objects)... – Henrik Jan 01 '18 at 16:59
  • Thanks Henrik -- that's helpful. PS. Why was the question marked off-topic? I could add code for a few of the operations that don't preserve dates, but I doubt it would improve the question. – Victor Kostyuk Jan 02 '18 at 00:28

2 Answers2

2

It's probably not hard to replace the calls to sapply with a function that does what you want. For example,

sapply2 <- function(X, FUN, ...) {
  do.call(c, lapply(X, FUN, ...))
}

This isn't as general-purpose as the original sapply, but if the function you're using in sapply(X, FUN) returns dates, it will preserve them. If you want to use the optional arguments to sapply, you'll need something more elaborate.

I don't know how many other functions are in your "etc.", but I'd guess it's not very many, and most fixes aren't all that hard.

user2554330
  • 37,248
  • 4
  • 43
  • 90
  • Nice on the `sapply`! Besides the set operations, `unlist` and `ifelse` come to mind. I can certainly work around those, so indeed "fixes aren't hard", but it's very annoying to have to remember to apply those fixes everywhere in ones program. To me, this is one of the obvious weaknesses of types in R. – Victor Kostyuk Jan 01 '18 at 16:57
2

The does not preserve Date class misfeature is an artefact of R itself, and how some base R functions are implemented. See e.g.

R> dates <- Sys.Date() + 0:2
R> for (d in dates) cat(d, "\n")
17532 
17533 
17534 
R> 

Essentially, the S3 class attributes gets dropped when you do certain vector operations:

R> as.vector(dates)
[1] 17532 17533 17534
R> 

So my recommendation is to pick a good container type you like and stick with it to do the operations there. I like data.table a lot for this. A quick example:

R> suppressMessages(library(data.table))
R> dt <- data.table(date=Sys.Date()+0:2, other=Sys.Date() + cumsum(runif(3)*100))
R> dt[, diff:=other-date][]
         date      other           diff
1: 2018-01-01 2018-03-30  88.88445 days
2: 2018-01-02 2018-06-09 158.23913 days
3: 2018-01-03 2018-07-30 208.62187 days
R> dt[, month:=month(other)][]
         date      other           diff month
1: 2018-01-01 2018-03-30  88.88445 days     3
2: 2018-01-02 2018-06-09 158.23913 days     6
3: 2018-01-03 2018-07-30 208.62187 days     7
R> 

Not only does the Date type persist (as evidenced by the difference operation returning a difftime object), but you also gets lots of helper functions (like month()) here. Grouping by date is also natural.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • I agree the lack of support for dates is a misfeature, but I don't see how `data.table` is relevant. I can do those operations outside of `data.table` and get the right answer. – user2554330 Jan 01 '18 at 17:07
  • 1
    With all due respect you seem to misunderstand what the question is. See eg this question (already linked above) about [preserving Date type](https://stackoverflow.com/questions/14527564/how-to-iterate-over-list-of-dates-without-coercion-to-numeric-in-r). You have no such issues with `data.table` which is why I find it convenient. As another plus, `data.table` solutions tend to be fast. But there are many different ways to skin the cat and if you have one you like better, by all means stick with it. – Dirk Eddelbuettel Jan 01 '18 at 17:11
  • But your own example fails: `cat(dt[1, "date"], "\n")` doesn't do any better than `cat(dates[1], "\n")`. – user2554330 Jan 01 '18 at 17:17
  • Use `format()` for display. Same story on `Date` class attribute being dropped. You still misunderstand _and the original question was on processing pipelines, ie many dates at once_ and your griping here has nothing to do with that. So please stop it. – Dirk Eddelbuettel Jan 01 '18 at 17:21
  • I agree, you should use `format()` for display, because `cat()` loses the class. But I still don't understand how `data.table` helps: I need to use `format()` to display an element from one of those too. I'm not griping, I'm asking for clarification. But if it upsets you too much, feel free to ignore the request. – user2554330 Jan 01 '18 at 17:36