Efficiently joining two data.tables in R (or SQL tables) on date ranges?

Question

I'm using hospital admissions data, and trying to tie together a table of observations:

dt_taken	patient_id	observation	value
2020-04-13 00:00:00	patient01	"Heart rate"	69
...

... with a table of admissions:

patient_id	admission_id	startdate	enddate
patient01	admission01	2020-04-01 00:04:20	2020-05-01 00:23:59
...

... such that it returns a list of observations tied to admissions, and rejects all those that were not made during an admission (eroneously recorded, or taken in outpatient visits etc):

dt_taken	admission_id	observation	value
2020-04-13 00:00:00	admission01	"Heart rate"	69
...

My relatively simplistic approach to-date has been to iterate over each patient, then each patient's admission, then each observation and assign it to that admission, but given I have >36k admissions and >1million observations this is incredibly time-consuming (and my government-issued laptop hates me for it).

Is there a more efficient way that I'm missing, either using {data.table} (I must confess to being an absolute newbie here, preferring to work in {tidyverse}) or even that I can run on the SQL server where the tables are stored to save my aging laptop?

Likely duplicate: [How to perform join over date ranges using data.table](https://stackoverflow.com/questions/27487949/how-to-perform-join-over-date-ranges-using-data-table) — Ian Campbell, May 13 '21 at 16:12
See here for some examples of doing this (a non-equi join) using `sqldf` and `data.table`. https://selbydavid.com/2021/02/13/joins/. `dplyr` doesn't currently allow these, requiring instead an inefficient full join + filter, but a related package `fuzzyjoin` helps add that ability. — Jon Spring, May 13 '21 at 16:32

r2evans · Accepted Answer · 2021-05-14T12:32:11.767

data.table

For data.table, this is mostly a dupe of How to perform join over date ranges using data.table?, though that doesn't provide the RHS[LHS, on=.(..)] method.

observations
#              dt_taken patient_id observation value
# 1 2020-04-13 00:00:00  patient01  Heart rate    69
admissions
#   patient_id admission_id           startdate             enddate
# 1  patient01  admission01 2020-04-01 00:04:20 2020-05-01 00:23:59

### convert to data.table
setDT(observations)
setDT(admissions)

### we need proper 'POSIXt' objects
observations[, dt_taken := as.POSIXct(dt_taken)]
admissions[, (dates) := lapply(.SD, as.POSIXct), .SDcols = dates]

And the join.

admissions[observations, on = .(patient_id, startdate <= dt_taken, enddate >= dt_taken)]
#    patient_id admission_id  startdate    enddate observation value
#        <char>       <char>     <POSc>     <POSc>      <char> <int>
# 1:  patient01  admission01 2020-04-13 2020-04-13  Heart rate    69

Two things that I believe are noteworthy:

in SQL (and similarly in other join-friendly languages), it is often shown as
```
select ...
from TABLE1 left join TABLE2 ...
```
suggesting that TABLE1 is the LHS (left-hand side) and TABLE2 is the RHS table. (This is a gross generalization, mostly gearing towards a left-join since that's all that data.table::[ supports; for inner/outer/full joins, you'll need merge(.) or other external mechanisms. See How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 for more discussion on JOINs, etc.)

From this, data.table::['s mechanism is effectively
```
TABLE2[TABLE1, on = .(...)]
RHS[LHS, on = .(...)]
```
(Meaning that the right-hand-side table is actually the first table from left-to-right ...)

The names in the output of inequi-joins are preserved from the RHS, see that dt_taken is not found. However, the values of those startdate and enddate columns are from dt_taken.

Because of this, I've often found the simplest way for me to wrap my brain around the renaming and values and such is when I'm not certain, I copy a join column into a new column and join using that column, then delete it post-merge. It's sloppy and lazy, but I've caught myself too many times missing something and thinking it was not what I had thought.

sqldf

This might be a little more direct if SQL seems more intuitive.

sqldf::sqldf(
  "select ob.*, ad.admission_id
   from observations ob
     left join admissions ad on ob.patient_id=ad.patient_id
         and ob.dt_taken between ad.startdate and ad.enddate")
#     dt_taken patient_id observation value admission_id
# 1 2020-04-13  patient01  Heart rate    69  admission01

Data (already data.table with POSIXt, works just as well with sqldf though regular data.frames will work just fine, too):

admissions <- setDT(structure(list(patient_id = "patient01", admission_id = "admission01", startdate = structure(1585713860, class = c("POSIXct", "POSIXt" ), tzone = ""), enddate = structure(1588307039, class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
observations <- setDT(structure(list(dt_taken = structure(1586750400, class = c("POSIXct", "POSIXt"), tzone = ""), patient_id = "patient01", observation = "Heart rate", value = 69L), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))

(I use setDT to repair the fact that we can't pass the .internal.selfref attribute here.)

Efficiently joining two data.tables in R (or SQL tables) on date ranges?

1 Answers1

data.table

sqldf