I'm trying to join one database of periods like this one:
id = c(rep(1,3), rep(2,3), rep(3,3))
start = as.Date(c("2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13"))
end = as.Date(c("2015-03-11", "2015-08-12", "2018-12-31", "2015-03-11", "2015-08-12", "2018-12-31","2015-03-11", "2015-08-12", "2018-12-31"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2014-07-01 2015-03-11
2: 1 2015-03-12 2015-08-12
3: 1 2016-08-13 2018-12-31
4: 2 2014-07-01 2015-03-11
5: 2 2015-03-12 2015-08-12
6: 2 2016-08-13 2018-12-31
7: 3 2014-07-01 2015-03-11
8: 3 2015-03-12 2015-08-12
9: 3 2016-08-13 2018-12-31
with one that has clinical registries (weight and height) like this one:
id_clin = (c(rep(1,2), rep (2,3), rep(3,4)))
date = as.Date(c("2014-10-23", "2016-09-01", "2017-01-01", "2014-08-01", "2015-02-01", "2017-06-01", "2018-03-05", "2018-09-01", "2018-11-30"))
weight = c(60, 65, 62, 75, 68, 90 , 102, 104 , 98 )
height = c(160, 160, 170, 175, 170, 200, 200, 200 ,200)
DT_clin = data.table(id_clin, date, weight, height)
DT_clin
id_clin date weight height
1: 1 2014-10-23 60 160
2: 1 2016-09-01 65 160
3: 2 2017-01-01 62 170
4: 2 2014-08-01 75 175
5: 2 2015-02-01 68 170
6: 3 2017-06-01 90 200
7: 3 2018-03-05 102 200
8: 3 2018-09-01 104 200
9: 3 2018-11-30 98 200
- When a registry of a clinical mesure (DT_clin) of an id is between the start and the end of a period (DT) of the same id, the values of the registry must be joined.
- If there is no value in DT_clin between the periods of DT, nothing has to be joined.
- If there is more than a value between the DT periods, I want to calculate the mean of the values that overlap.
The desired outcome would look something like this*:
id start end date date2 weight height
1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
2: 1 2015-03-12 2015-08-12 <NA> <NA> NA NA
3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
5: 2 2015-03-12 2015-08-12 <NA> <NA> NA NA
6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
7: 3 2014-07-01 2015-03-11 <NA> <NA> NA NA
8: 3 2015-03-12 2015-08-12 <NA> <NA> NA NA
9: 3 2016-08-13 2018-12-31 2018-03-05 2018-11-30 101.3 200.0
Also, if there is a way to make more than one operation for the different variables I will also be interested to know a way. (ex. calculate the mean of weight and the max value of height at the same time i make the join)
I've tested foverlaps with good results when there is only one value, but I'm unable to acomplish my objective when there is more than one value that overlap:
setkey(DT, id, start, end)
setkey(DT_clin, id_clin, date, date2)
foverlaps(DT[id == "1", ], DT_clin[id == "1",], by.x =c("id", "start", "end") , by.y = c("id_clin", "date", "date2" ), nomatch = NA )
Should I use Non-Equi-Joins?
*I duplicated date to create date2 and faked an interval of time