1

Objective: To map a date from one dataframe to another given that it falls within a certain date interval. For example, let's say we need to deliver a gift within the time interval from either 20/12/2017 to 25/12/2017 or 26/12/2017 to 30/12/2017, and receive a response from the gift sender on the 23/12/2017. I want to create a function that can identify where to place the response date based on the date interval it falls within. In the example above, the response date would fall in the interval 20/12/2017 to 25/12/2017. Note: The term "Match" below means a certain condition is met from one data frame with another.

Here is some sample code to demonstrate what I mean.

# Creating the Data Frame with a start and end date interval
StartDate <- seq(as.Date("2000/1/1"), by = "month", length.out = 10)

EndDate <- StartDate +7

Dates_Interval <- data.frame(StartDate,EndDate)

# Creating a second data frame with the response dates only 
ResponseDate <- seq(as.Date("2000/1/6"), by = "month", length.out = 10)

Response_Substitute <- data.frame(ResponseDate)
# Substituting random NA values 
Response_Substitute[c(1,5,8),] <- NA


# > Response_Substitute 
#     ResponseDate
#    1          <NA>
#    2    2000-02-09
#    3    2000-03-06
#    4    2000-04-06
#    5          <NA>
#    6    2000-06-06
#    7    2000-07-06
#    8          <NA>
#    9    2000-09-06
#    10   2000-10-06

# Creating a function which evaluates a value in data frame two             
# (Response_Substitute) and checks 
# whether it meets
# a condition in Dates_Interval. 

dateresponses <- function(x,y,z) {
  sub_date <- ifelse ( y <=x && x <= z, x, NA)
  converteddate <- as.Date(sub_date, origin = "1899-12-30")
  return(converteddate)
}
# Example of the function in use to show how it matches a certain condition. 
x <- Response_Substitute[2,1] 
b <- dateresponses(x,Dates_Interval[2,1],Dates_Interval[2,2])


# > b
# [1] "1930-02-04"

# Example of the function in use to show when a response date does not 
# match a certain condition
   x <- Response_Substitute[2,1] <- as.Date("2000/2/9")
   b <- dateresponses(x,Dates_Interval[2,1],Dates_Interval[2,2])
# > b
#  [1] NA

# Example of the function in use to show when there is no response date in      
# the Response_Substitute variable 
   x <- Response_Substitute[1,1] 
   b <- dateresponses(x,Dates_Interval[2,1],Dates_Interval[2,2])
# > b
#  [1] NA

I need a function that will be able to create a new column in the Dates_Interval data frame which matches the response date with the date interval it falls within from StartDate and EndDate columns. If there is no match, then the response will be NA if there is no response. If there is a response but the response date does not fall into any interval, then I want a dataframe to be created that captures unmatched responses.

This is what the final dataframe could look like:

Dates_Interval$ResponseDate <- Response_Substitute
     # > Dates_Interval
   # StartDate    EndDate ResponseDate
# 1  2000-01-01 2000-01-08         <NA>
# 2  2000-02-01 2000-02-08   2000-02-06
# 3  2000-03-01 2000-03-08   2000-03-06
# 4  2000-04-01 2000-04-08   2000-04-06
# 5  2000-05-01 2000-05-08         <NA>
# 6  2000-06-01 2000-06-08   2000-06-06
# 7  2000-07-01 2000-07-08   2000-07-06
# 8  2000-08-01 2000-08-08         <NA>
# 9  2000-09-01 2000-09-08   2000-09-06
# 10 2000-10-01 2000-10-08   2000-10-06

And for response dates that are not NA but do not match any interval another dataframe could be created like this:

Unmatched_Response_Date <- data.frame(seq(as.Date("2000/1/9"), by = "month", 
length.out = 2))

colnames(Unmatched_Response_Date) <- "Unmatched Responses"

Unmatched_Response_Date
# > Unmatched_Response_Date
# Unmatched Responses
# 1          2000-01-09
# 2          2000-02-09

EDIT: There is bug I have noticed when using the dateresponses function. When I use a date from the Response_substitute data frame. The output of the date is not the same as the data frame. e.g. for Response_substitute[2,1] the value should be 2000-02-09 but instead I get 1930-02-04. Any ideas also for solving this issue?

  • Why don't you use `NA` instead of `"N/A"` within the `ifelse` in your function? That returns `NA` instead of the error message for your error example. – LAP Nov 07 '17 at 11:36
  • 1
    @LAP That is a great point! I will change that into the code to sort out the limitation issue. This could help us to solve the overall question thank you. – MrReference Nov 07 '17 at 12:12
  • https://stackoverflow.com/questions/21560500/data-table-merge-based-on-date-ranges. That appears to have a similar idea. If you want, I can write a response to your particular problem. – jacobsg Nov 07 '17 at 22:05
  • @jacobsg I had a look at the link you sent me and will see what I can do with it. It would still be great if you could provide a response to my particular problem nevertheless. Much appreciated thanks! – MrReference Nov 08 '17 at 21:03

1 Answers1

0

Here is the code you provided:

StartDate <- seq(as.Date("2000/1/1"), by = "month", length.out = 10)
EndDate <- StartDate +7
Dates_Interval <- data.frame(StartDate,EndDate)
# Creating a second data frame with the response dates only 
ResponseDate <- seq(as.Date("2000/1/6"), by = "month", length.out = 10)
Response_Substitute <- data.frame(ResponseDate)
# Substituting random NA values 
Response_Substitute[c(1,5,8),] <- NA

So to answer you question, I added another date column so that we had an interval in both data.frames. I also removed the NA values from the first data.frame containing the responses. These don't appear to have any bearing on your expected output. Correct me if wrong.

Response_Substitute$Date2 <- Response_Substitute$ResponseDate - 1
Response_Substitute <- Response_Substitute[!is.na(Response_Substitute$ResponseDate),]

The crux of this question comes down to using a data.table function called foverlaps(). From the documentation, this is a fast overlap join function. It is designed to find where two intervals overlap and join the data together. The code below does just that.

This is also why I needed to generate the other date in the Response_Substitute. foverlaps() requires two intervals to work.

library(data.table)
Dates_Interval <- as.data.table(Dates_Interval)
Response_Substitute <- as.data.table(Response_Substitute)
setkey(Response_Substitute, Date2, ResponseDate)
join_df <- foverlaps(Dates_Interval, Response_Substitute, 
          by.x = c('StartDate', 'EndDate'))

Output:

   ResponseDate      Date2  StartDate    EndDate
 1:         <NA>       <NA> 2000-01-01 2000-01-08
 2:   2000-02-06 2000-02-05 2000-02-01 2000-02-08
 3:   2000-03-06 2000-03-05 2000-03-01 2000-03-08
 4:   2000-04-06 2000-04-05 2000-04-01 2000-04-08
 5:         <NA>       <NA> 2000-05-01 2000-05-08
 6:   2000-06-06 2000-06-05 2000-06-01 2000-06-08
 7:   2000-07-06 2000-07-05 2000-07-01 2000-07-08
 8:         <NA>       <NA> 2000-08-01 2000-08-08
 9:   2000-09-06 2000-09-05 2000-09-01 2000-09-08
10:   2000-10-06 2000-10-05 2000-10-01 2000-10-08

Final Step is to remove the column, and generate the empty vector of the non-matches.

# Removes the Date2 Column
join_df[, Date2:=NULL] 
# Generate list of responses that didn't join
setdiff(Response_Substitute$ResponseDate, join_df$ResponseDate) 

Does this work for your problem? Further reading: 1, 2

jacobsg
  • 121
  • 6