0

I am trying to make a new column in my data.table. I have two columns, one with a start date and one with an end date. The starting date always is 2016-02-28. The end date in some cases is 2014-12-31 and in others it is 2020-12-31 (all in YYYY-MM-DD format).

In the first case it's evident that I should get a negative difference in dates. In the second case it is positive.

I want to use the sapply function with an ifelse statement to determine the difference in dates. Any time, the difference is negative, I want R to replace this with the value 1.

I do this as follows.

sapply(df$end.date, function(x) { ifelse(df$end.date>start_date, as.integer(length(seq(from=start_date, to=as.POSIXct(x,format="%Y-%m-%d"), by ='month')) ), 1) } )

Unfortunately, I get the following error

Error in seq.POSIXt(from = start_date, to = as.POSIXct(df$end.date,  : 
  'from' must be of length 1

How can I make this work?

PS: both start_date and df$end.date are in POSIXct format in a data.table.

Strawhat
  • 5
  • 1
  • 3

2 Answers2

2

ifelse is already vectorised, doubling up sapply and ifelse is redundant.

Unfortunately ifelse won’t work here because we cannot get the month difference for negative dates (as per your comment). So we just use if in combination with mapply instead:

months_between = function (start, end) {
     if (end > start)
         length(seq(start, end, by = 'month'))
     else
         1
}

df$new_column = mapply(months_between, df$start.date, df$end.date)

I’m also pretty sure that there’s a better way to write months_between but I’m not versed in the base R date manipulation functions since they are generally quite bad; I recommend using the ‹lubridate› package instead.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Still get this error though Error in seq.int(r1$mon, 12 * (to0$year - r1$year) + to0$mon, by) : wrong sign in 'by' argument – Strawhat Apr 10 '17 at 15:07
  • @Strawhat Ah, very good point, it will still fail for negative months. Guh. See my edit. – Konrad Rudolph Apr 10 '17 at 15:09
  • It works, though it takes a lot of time to run the mapply function. I'll wait for other responses and after that I'll upvote it. – Strawhat Apr 11 '17 at 11:01
1

I think you're approach is overly complicated. If you're going to use sapply, you ought to be able to avoid ifelse since you will be able to focus on one value at a time (this assumes you are running a vector through sapply. This might not hold true if running a list through sapply). If you really want to use an apply function, however, you'd be better off using mapply with an if ... else clause.

But the apply function isn't necessary at all. In fact, the ifelse function isn't necessary. You can simplify the process a great deal with:

# Borrowed code from http://stackoverflow.com/questions/1995933/number-of-months-between-two-dates/1996404
elapsed_months <- function(end_date, start_date) {
  mapply(
    function(end_date, start_date){
      ed <- as.POSIXlt(end_date)
      sd <- as.POSIXlt(start_date)
      12 * (ed$year - sd$year) + (ed$mon - sd$mon)
    },
    end_date,
    start_date,
    SIMPLIFY = FALSE
  )
}


DFrame <- data.frame(start = rep(as.Date("2016-02-28"), 2),
                     end = as.Date(c("2014-12-31", "2020-12-31")))

DFrame$diff <- elapsed_months(DFrame$end, DFrame$start)
DFrame$diff[DFrame$diff < 0] <- 1

DFrame

All I did was calculate the difference for all of the variables, obtain an index for the negative values, and replace them with 1.

An alternative approach would be to do the indexing up front. This way you aren't calculating the difference in dates for any values you will eventually change. This might have a benefit if you have a few million rows, but I would guess the performance increase would be small.

DFrame$diff2 <- vector("numeric", nrow(DFrame))
end_first <- DFrame$end < DFrame$start
DFrame$diff2[!end_first] <- elapsed_months(DFrame$end[!end_first], DFrame$start[!end_first]) 
DFrame$diff2[end_first] <- 1
Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • Does `-` do the month-wise difference between dates? – Konrad Rudolph Apr 10 '17 at 15:01
  • No, it doesn't. I'll be honest, I didn't read too much into his code until I saw your answer. I just saw `sapply` and `ifelse` and knew there would be a better solution. The text of his answer just said difference in dates. I'll adapt my answer to get the desired output. – Benjamin Apr 10 '17 at 15:04
  • I borrowed code from http://stackoverflow.com/questions/1995933/number-of-months-between-two-dates/1996404 to get the difference in months in a decent base solution. There's probably a decent `lubridate` solution somewhere, but I haven't spent any time looking for it. @KonradRudolph – Benjamin Apr 10 '17 at 15:11