Join dataframes by id and overlapping date range

Question

I have two dataframes x and y that contain columns for ids and for dates.

id.x <- c(1, 2, 4, 5, 7, 8, 10)
date.x <- as.Date(c("2015-01-01", "2015-01-02", "2015-01-21", "2015-01-13", "2015-01-29", "2015-01-01", "2015-01-03"),format = "%Y-%m-%d")
x <- data.frame(id.x, date.x)
id.y <- c(1, 2, 3, 6, 7, 8, 9)
date.y <- as.Date(c("2015-01-03", "2015-01-29", "2015-01-22", "2015-01-13", "2015-01-29", "2014-12-31", "2015-01-03"), format = "%Y-%m-%d")
y <- data.frame(id.y, date.y)

I would like to join them into a new dataframe z by matching id and wether date.y occurs within date.x + 3 days, e.g. individual "1" had event "y" occur on date.y = "2015-01-03" which is within 3 days of event x on date.x = "2015-01-01".

Please stop using `cbind` to create data.frames.. There's a data.frame function available just for that. — Arun, Jul 24 '16 at 17:48
@Arun Noted. If anyone else was curious about the difference between `cbind.data.frame()` and `data.frame()` it is nicely summarized [here](https://docs.tibco.com/pub/enterprise-runtime-for-R/1.5.0_may_2013/TERR_1.5.0_LanguageRef/base/cbind.data.frame.html) — user6571411, Jul 24 '16 at 19:06
Use `check.names=FALSE` in the call to `data.frame()` (to get the behaviour of `cbind.data.frame`), which seems to be the only default difference. — Arun, Jul 24 '16 at 19:24

score 2 · Answer 1 · answered Jul 24 '16 at 17:56

Using the development version of data.table, v1.9.7, where non-equi (or conditional) joins was recently implemented, we can do this in a straightforward (and efficient) manner.. See installation instructions here.

require(data.table) # v1.9.7+
setDT(x)
setDT(y) ## convert both data.frames to data.tables by reference

x[, date.x.plus3 := date.x + 3L]
y[x, .(id.x, date.x, date.y=x.date.y), 
     on=.(id.y == id.x, date.y >= date.x, date.y <= date.x.plus3)]
#    id.x     date.x     date.y
# 1:    1 2015-01-01 2015-01-03
# 2:    2 2015-01-02       <NA>
# 3:    4 2015-01-21       <NA>
# 4:    5 2015-01-13       <NA>
# 5:    7 2015-01-29 2015-01-29
# 6:    8 2015-01-01       <NA>
# 7:   10 2015-01-03       <NA>

Solutions that join on a dummy column and then filter based on the conditions are generally not scalable (as the number of rows quickly explode), and solutions that loop through rows and run the filtering condition for each row are slow, well, because they perform the operation row-wise.

This solution does neither, i.e., performs the conditional join directly, and therefore should be performant both in terms of runtime and memory.

score 1 · Accepted Answer · edited May 23 '17 at 12:01

You can create an ifelse statement that creates a vector that is equal to date.x if date.y <= date.x + 3 and date.y >= date.x and equal to date.y otherwise. Then merge the two based on this vector:

id.x <- c(1, 2, 4, 5, 7, 8, 10)
date.x <- as.Date(c("2015-01-01", "2015-01-02", "2015-01-21", "2015-01-13", "2015-01-29", "2015-01-01", "2015-01-03"),format = "%Y-%m-%d")
x <- cbind.data.frame(id.x, date.x)
id.y <- c(1, 2, 3, 6, 7, 8, 9)
date.y <- as.Date(c("2015-01-03", "2015-01-29", "2015-01-22", "2015-01-13", "2015-01-29", "2014-12-31", "2015-01-03"), format = "%Y-%m-%d")
y <- cbind.data.frame(id.y, date.y)

safe.ifelse <- function(cond, yes, no) structure(ifelse(cond, yes, no), class = class(yes))

match <- safe.ifelse(date.y <= date.x+3 & date.y >= date.x, 
            match <- date.x,
            match <- date.y)

y$date.x <- match
names(y)[1] <- "id.x"

dplyr::left_join(x, y, by=c("id.x","date.x"))

  id.x     date.x     date.y
1    1 2015-01-01 2015-01-03
2    2 2015-01-02       <NA>
3    4 2015-01-21       <NA>
4    5 2015-01-13       <NA>
5    7 2015-01-29 2015-01-29
6    8 2015-01-01       <NA>
7   10 2015-01-03       <NA>

I borrowed the safe.ifelse function from this post because the base ifelse statement results in a numeric vector rather than a date vector.

This is unsafe and in general would lead to incorrect solutions. Your generation of `match` column doesn't take `id` column into account at all. — Arun, Jul 24 '16 at 18:18
I ran into the problem that Arun mentions. My work around was the following `temp <- merge(x, y, by.x = "id.x", by.y = "id.y", all = TRUE)` `temp.subset <- safe.ifelse(date.y <= date.x+3 & date.y >= date.x, TRUE, FALSE)` `joined.df <- temp[which(temp.subset == TRUE), ]` — user6571411, Jul 24 '16 at 18:55

Sathish · Answer 3 · 2016-07-24T05:38:49.003

Using inner join of y and x data tables by setting the keys to id of both datatables, and then checking for date conditions, and finally extract the true ones.

library("data.table")

x <- as.data.table(x)

y <- as.data.table(y)

setkey(x, id.x)

setkey(y, id.y)

z <- y[x, nomatch = 0][, j = .(is_true = ((date.y <= date.x + 3) & (date.y > date.x)), id.y, date.x, date.y)][i = is_true == TRUE]

> z
   is_true id.y     date.x     date.y
1:    TRUE    1 2015-01-01 2015-01-03

Join dataframes by id and overlapping date range

3 Answers3

Linked