2

I'm trying to join one database of periods like this one:

id = c(rep(1,3), rep(2,3), rep(3,3))
start = as.Date(c("2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13"))
end = as.Date(c("2015-03-11", "2015-08-12", "2018-12-31", "2015-03-11", "2015-08-12", "2018-12-31","2015-03-11", "2015-08-12", "2018-12-31"))

DT = data.table(id, start, end)

DT

   id      start        end
1:  1 2014-07-01 2015-03-11
2:  1 2015-03-12 2015-08-12
3:  1 2016-08-13 2018-12-31
4:  2 2014-07-01 2015-03-11
5:  2 2015-03-12 2015-08-12
6:  2 2016-08-13 2018-12-31
7:  3 2014-07-01 2015-03-11
8:  3 2015-03-12 2015-08-12
9:  3 2016-08-13 2018-12-31

with one that has clinical registries (weight and height) like this one:

id_clin = (c(rep(1,2), rep (2,3), rep(3,4)))
date = as.Date(c("2014-10-23", "2016-09-01", "2017-01-01", "2014-08-01", "2015-02-01", "2017-06-01", "2018-03-05", "2018-09-01", "2018-11-30"))
weight = c(60, 65, 62, 75, 68, 90 , 102, 104 , 98 )
height = c(160, 160, 170, 175, 170, 200, 200, 200 ,200)

DT_clin = data.table(id_clin, date, weight, height)

DT_clin

   id_clin       date weight height
1:       1 2014-10-23     60    160
2:       1 2016-09-01     65    160
3:       2 2017-01-01     62    170
4:       2 2014-08-01     75    175
5:       2 2015-02-01     68    170
6:       3 2017-06-01     90    200
7:       3 2018-03-05    102    200
8:       3 2018-09-01    104    200
9:       3 2018-11-30     98    200
  • When a registry of a clinical mesure (DT_clin) of an id is between the start and the end of a period (DT) of the same id, the values of the registry must be joined.
  • If there is no value in DT_clin between the periods of DT, nothing has to be joined.
  • If there is more than a value between the DT periods, I want to calculate the mean of the values that overlap.

The desired outcome would look something like this*:

   id      start        end       date       date2       weight       height
1:  1 2014-07-01 2015-03-11 2014-10-23  2014-10-23         60.0        160.0
2:  1 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
3:  1 2016-08-13 2018-12-31 2016-09-01  2016-09-01         65.0        160.0
4:  2 2014-07-01 2015-03-11 2014-08-01  2015-02-01         71.5        172.5
5:  2 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
6:  2 2016-08-13 2018-12-31 2017-01-01  2017-01-01         62.0        170.0
7:  3 2014-07-01 2015-03-11       <NA>        <NA>           NA           NA
8:  3 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
9:  3 2016-08-13 2018-12-31 2018-03-05  2018-11-30        101.3        200.0

Also, if there is a way to make more than one operation for the different variables I will also be interested to know a way. (ex. calculate the mean of weight and the max value of height at the same time i make the join)

I've tested foverlaps with good results when there is only one value, but I'm unable to acomplish my objective when there is more than one value that overlap:

setkey(DT, id, start, end)
setkey(DT_clin, id_clin, date, date2)

foverlaps(DT[id == "1", ], DT_clin[id == "1",], by.x =c("id", "start", "end") , by.y = c("id_clin", "date", "date2" ), nomatch = NA )

Should I use Non-Equi-Joins?

*I duplicated date to create date2 and faked an interval of time

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40

2 Answers2

1

using a non-equi join, and then summarise by id, start & end

ans <- DT_clin[DT, on = .(date >= start, date <= end, id_clin = id)]
ans[, .(date   = min(date2),
        date2  = max(date2),
        weight = mean(weight),
        height = mean(height)), 
    by = .(id = id_clin, start = date, end = date.1)]

#    id      start        end       date      date2 weight height
# 1:  1 2014-07-01 2015-03-11 2014-10-23 2014-10-23   60.0  160.0
# 2:  1 2015-03-12 2015-08-12       <NA>       <NA>     NA     NA
# 3:  1 2016-08-13 2018-12-31 2016-09-01 2016-09-01   65.0  160.0
# 4:  2 2014-07-01 2015-03-11 2014-08-01 2015-02-01   71.5  172.5
# 5:  2 2015-03-12 2015-08-12       <NA>       <NA>     NA     NA
# 6:  2 2016-08-13 2018-12-31 2017-01-01 2017-01-01   62.0  170.0
# 7:  3 2014-07-01 2015-03-11       <NA>       <NA>     NA     NA
# 8:  3 2015-03-12 2015-08-12       <NA>       <NA>     NA     NA
# 9:  3 2016-08-13 2018-12-31 2017-06-01 2018-11-30   98.5  200.0
Wimpel
  • 26,031
  • 1
  • 20
  • 37
1

With foverlaps:

library(data.table)
setkey(DT_clin, id_clin, date, date2)

foverlaps(DT, DT_clin, 
          by.x =c("id", "start", "end"), 
          by.y = c("id_clin", "date", "date2" ), nomatch = NA )[
          ,.(datemin = min(date),
             datemax = max(date),
             weight  = mean(weight,na.r=T),
             height  = mean(height,na.rm=T)),
           by=.(id,start,end)]

   id      start        end    datemin    datemax weight height
1:  1 2014-07-01 2015-03-11 2014-10-23 2014-10-23   60.0  160.0
2:  1 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
3:  1 2016-08-13 2018-12-31 2016-09-01 2016-09-01   65.0  160.0
4:  2 2014-07-01 2015-03-11 2014-08-01 2015-02-01   71.5  172.5
5:  2 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
6:  2 2016-08-13 2018-12-31 2017-01-01 2017-01-01   62.0  170.0
7:  3 2014-07-01 2015-03-11       <NA>       <NA>    NaN    NaN
8:  3 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
9:  3 2016-08-13 2018-12-31 2017-06-01 2018-11-30   98.5  200.0
Waldi
  • 39,242
  • 6
  • 30
  • 78