data.table merge based on date ranges

Question

I have two tables, policies and claims

policies<-data.table(policyNumber=c(123,123,124,125), 
                EFDT=as.Date(c("2012-1-1","2013-1-1","2013-1-1","2013-2-1")), 
                EXDT=as.Date(c("2013-1-1","2014-1-1","2014-1-1","2014-2-1")))
> policies
   policyNumber       EFDT       EXDT
1:          123 2012-01-01 2013-01-01
2:          123 2013-01-01 2014-01-01
3:          124 2013-01-01 2014-01-01
4:          125 2013-02-01 2014-02-01


claims<-data.table(claimNumber=c(1,2,3,4), 
                   policyNumber=c(123,123,123,124),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31")),
                   claimAmount=c(10,20,20,15))
> claims
   claimNumber policyNumber   lossDate claimAmount
1:           1          123 2012-02-01          10
2:           2          123 2012-08-15          20
3:           3          123 2013-01-01          20
4:           4          124 2013-10-31          15

The policy table really contains policy-terms, since each row is uniquely identified by a policy number along with an effective date.

I want to merge the two tables in a way that associates claims with policy-terms. A claim is associated with a policy term if it has the same policy number and the lossDate of the claim falls within the effective date and expiration date of the policy-term (effective dates are inclusive bounds and expiration dates are exclusive bounds.) How do I merge the tables in this way?

This should be similar to a left outer join. The result should look like

   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount
1:          123 2012-01-01 2013-01-01           1 2012-02-01          10
2:          123 2012-01-01 2013-01-01           2 2012-08-15          20
3:          123 2013-01-01 2014-01-01           3 2013-01-01          20
4:          124 2013-01-01 2014-01-01           4 2013-10-31          15
5:          125 2013-02-01 2014-02-01          NA       <NA>          NA

possible duplicate of [merging endpoints of a range with a sequence](http://stackoverflow.com/questions/17597508/merging-endpoints-of-a-range-with-a-sequence) — eddi, Feb 04 '14 at 18:37
I've had trouble with using roll when what you really want to do is match a range. If you have trouble getting the results you expect one approach would be to transform the range into a unique row for each possible value. An example would be this question which was, oddly enough, answered by eddi. http://stackoverflow.com/questions/16423817/create-dummy-variables-in-one-table-based-on-range-of-dates-in-another-table — Dean MacGregor, Feb 04 '14 at 19:31
@DeanMacGregor that's from before I learned how to use `roll` :) You can see comments there for another SO post where a very similar problem is solved using `roll`. — eddi, Feb 04 '14 at 19:39
@eddi I know `roll` should work for these problems but the "old way" seems more robust to data overlapping in weird ways. I've had other problems where `roll` should work but I think maybe my data sets had weird overlaps or for some other reason it just didn't give the results I expected. Long story short, certainly make `roll` the first attempt but if you're beating your head against the wall because something doesn't match up right then maybe a pre-transform will be a work-around. — Dean MacGregor, Feb 04 '14 at 19:57
@eddi `roll` appears to remove my lossDate column. Do you know how I can keep that column in the result? — Ben, Feb 04 '14 at 20:30

score 13 · Accepted Answer · edited Oct 28 '14 at 22:06

Version 1 (updated for data.table v1.9.4+)

Try this:

# Policies table; I've added policyNumber 126:
policies<-data.table(policyNumber=c(123,123,124,125,126), 
                     EFDT=as.Date(c("2012-01-01","2013-01-01","2013-01-01","2013-02-01","2013-02-01")), 
                     EXDT=as.Date(c("2013-01-01","2014-01-01","2014-01-01","2014-02-01","2014-02-01")))

# Claims table; I've added two claims for 126 that are before and after the policy dates:
claims<-data.table(claimNumber=c(1,2,3,4,5,6), 
                   policyNumber=c(123,123,123,124,126,126),
                   lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31","2012-06-01","2014-03-01")),
                   claimAmount=c(10,20,20,15,5,25))

# Set the keys for policies and claims so we can join them:
setkey(policies,policyNumber,EFDT)
setkey(claims,policyNumber,lossDate)

# Join the tables using roll
# ans<-policies[claims,list(EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),roll=T][,EFDT:=NULL] ## This worked with earlier versions of data.table, but broke when they updated the by-without-by behavior...
ans<-policies[claims,list(.EFDT=EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),by=.EACHI,roll=T][,`:=`(EFDT=.EFDT, .EFDT=NULL)]

# The claim should have inPolicy==T where lossDate is between EFDT and EXDT:
ans[lossDate>=EFDT & lossDate<=EXDT, inPolicy:=T]

# Set the keys again, but this time we'll join on both dates:
setkey(ans,policyNumber,EFDT,EXDT)
setkey(policies,policyNumber,EFDT,EXDT)

# Union the ans table with policies that don't have any claims:
ans<-rbindlist(list(ans, ans[policies][is.na(claimNumber)]))

ans
#   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
#1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
#2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
#3:          123 2013-01-01 2014-01-01           3 2013-01-01          20     TRUE
#4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
#5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
#6:          126 2013-02-01 2014-02-01           6 2014-03-01          25    FALSE
#7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA

Version 2

@Arun suggested using the new foverlaps function from data.table. My attempt below seems harder, not easier, so please let me know how to improve it.

## The foverlaps function requires both tables to have a start and end range, and the "y" table to be keyed
claims[, lossDate2:=lossDate]  ## Add a redundant lossDate column to use as the end range for claims
setkey(policies, policyNumber, EFDT, EXDT) ## Set the key for policies ("y" table)

## Find the overlaps, remove the redundant lossDate2 column, and add the inPolicy column:
ans2 <- foverlaps(claims, policies, by.x=c("policyNumber", "lossDate", "lossDate2"))[, `:=`(inPolicy=T, lossDate2=NULL)]

## Update rows where the claim was out of policy:
ans2[is.na(EFDT), inPolicy:=F]

## Remove duplicates (such as policyNumber==123 & claimNumber==3),
##   and add policies with no claims (policyNumber==125):
setkey(ans2, policyNumber, claimNumber, lossDate, EFDT) ## order the results
setkey(ans2, policyNumber, claimNumber) ## set the key to identify unique values
ans2 <- rbindlist(list(
  unique(ans2), ## select only the unique values
  policies[!.(ans2[, unique(policyNumber)])] ## policies with no claims
), fill=T)

ans2
##    policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
## 1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
## 2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
## 3:          123 2012-01-01 2013-01-01           3 2013-01-01          20     TRUE
## 4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
## 5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
## 6:          126       <NA>       <NA>           6 2014-03-01          25    FALSE
## 7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA

Version 3

Using foverlaps(), another version:

require(data.table) ## 1.9.4+
setDT(claims)[, lossDate2 := lossDate]
setDT(policies)[, EXDTclosed := EXDT-1L]
setkey(claims, policyNumber, lossDate, lossDate2)
foverlaps(policies, claims, by.x=c("policyNumber", "EFDT", "EXDTclosed"))

foverlaps() requires both start and end ranges/intervals. Therefore, we duplicate lossDate column on to lossDate2.

Since EXDT needs to be open interval, we subtract one from it, and place it in a new column EXDTclosed.

Now, we set the key. foverlaps() requires the last two key columns to be intervals. So they're specified last. And we also want overlapping join to first match by policyNumber. Hence, it's also specified in the key.

We need to set key on claims (check ?foverlaps). We don't have to set key on policies. But you can if you wish (then you can skip by.x argument as it by default takes the key value). Since we don't set the key for policies here, we'll specify explicitly the corresponding columns in by.x argument. The overlap type by default is any, which we don't have to change (and therefore not specified). This results in:

#    policyNumber claimNumber   lossDate claimAmount  lossDate2       EFDT       EXDT EXDTclosed
# 1:          123           1 2012-02-01          10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
# 2:          123           2 2012-08-15          20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
# 3:          123           3 2013-01-01          20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
# 4:          124           4 2013-10-31          15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
# 5:          125          NA       <NA>          NA       <NA> 2013-02-01 2014-02-01 2014-01-31

Thanks for the response. It's throwing an error the first time it's setting `ans`. object 'EFDT' not found — Ben, Feb 04 '14 at 21:07
Hmm... Wonder if the `data.table` version makes a difference (I'm using 1.8.11)? As a workaround, see what happens if you take off the last little bit (`[,EFDT:=NULL]`) from that line of code. — dnlbrky, Feb 04 '14 at 21:13
@dnkbrky, yes, in 1.8.11, the key columns are also visible in `j` during a by-without-by. It's more or less related to [**this FR**](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978). — Arun, Feb 05 '14 at 01:05
@BenGorman, what are the resulting column names when you run `policies[claims]`? — dnlbrky, Feb 05 '14 at 13:50
I've updated the answer to work with more recent versions of data.table (I'm using 1.9.4 from CRAN). See the data.table change log for more info about the [by-without-by](https://github.com/Rdatatable/data.table#changes-in-v194--on-cran-2-oct-2014) (which I have never been able to understand). — dnlbrky, Oct 28 '14 at 14:28
+1. @dnlbrky, this could be made much simpler using `foverlaps()` function from 1.9.4+. Would you like to give it a try? — Arun, Oct 28 '14 at 14:32
Thanks @Arun. I gave `foverlaps` a shot and updated the answer, but I must be missing something since it seems harder. — dnlbrky, Oct 28 '14 at 21:42
@dnlbrky, sure. I've updated with another answer... What do you think? — Arun, Oct 28 '14 at 22:07
Thanks @Arun. I think this answers the question as asked. The OP didn't include this, but I would want to see all policy holders and all claims in the result. So I would want to see that policy 126 had two claims that were out of policy. Also, the nifty `EXDTclosed:=EXDT-1L` bit removes the duplicates for policy 123 in this case, but what if the OP had wanted inclusive boundaries? How would you handle cases where the loss date is equal to the EXDT from one policy period and the EFDT from the next? — dnlbrky, Oct 28 '14 at 22:24
I'll answer in order. 1) I'd first do a join on `policies` to get all policyNumbers, and then do `foverlaps()`. 2) If you want inclusive boundaries, you don't have to subtract 1L. Use `EXDT` as such. 3) This one I don't understand. Could you maybe show a case where the current solution breaks? — Arun, Oct 28 '14 at 22:39

score 3 · Answer 2 · edited Oct 23 '14 at 00:24

I think this does mostly what you want. I need to run so don't have time to add the policy with no claims and clean the columns up, but I think the difficult issues are addressed:

setkey(policies, policyNumber, EXDT)
policies[, EXDT2:=EXDT]
policies[claims[, list( policyNumber, lossDate, lossDate, claimNumber, claimAmount)], roll=-Inf]
#    policyNumber       EXDT       EFDT      EXDT2   lossDate claimNumber claimAmount
# 1:          123 2012-02-01 2012-01-01 2013-01-01 2012-02-01           1          10
# 2:          123 2012-08-15 2012-01-01 2013-01-01 2012-08-15           2          20
# 3:          123 2013-01-01 2012-01-01 2013-01-01 2013-01-01           3          20
# 4:          124 2013-10-31 2013-01-01 2014-01-01 2013-10-31           4          15

Also, note it is trivial to remove/highlight claims outside of policy dates from this result.

data.table merge based on date ranges

2 Answers2

Linked