0

I'm trying to know which observations are active during specific episodes of time

My real objective is to know which Diagnosis (dx) are active during a Pregnancy period (9 months) by a patient. It has to be considered that a patient can have different number of pregnancies along her life and also a different number of diagnosis (the dx can or can't be active ).

I have tried foverlaps like here or here but weren't exactly what I was looking for. The real problem is this and it's well documented but not for R. capture

Here they make it work for SQL I think. So I hope it's solvable...

I also tried Non-Equi Joins like this but I'm not able to make it work the way I want...

Let's get into it:

I have one DB with patient (id) Diagnosis (dx), the time it was registered (InD_dx) and the time it ended (EndD_dx) like this:

id <- rep("a", 11)
InD_dx <- as.Date(c("2005-10-04","2005-10-06","2005-10-06", "2008-04-07", "2010-05-10", "2012-04-24", "2012-04-24", "2012-05-15", "2014-03-20", "2014-04-22", "2017-11-30"), format = "%Y-%m-%d")
EndD_dx <- as.Date(c("2020-12-31","2020-12-31","2020-12-31", "2008-11-05", "2011-01-17", "2015-07-20", "2013-01-01", "2015-07-20", "2015-12-04", "2020-12-31", "2020-12-31"), format = "%Y-%m-%d")
dx <- c("A", "B", "C", "P", "P", "D", "P", "E", "F", "G", "H")

DT = data.table(id,InD_dx,EndD_dx, dx)

    DT
    id   InD_dx    EndD_dx   dx
 1:  a 2005-10-04 2020-12-31  A
 2:  a 2005-10-06 2020-12-31  B
 3:  a 2005-10-06 2020-12-31  C
 4:  a 2008-04-07 2008-11-05  P
 5:  a 2010-05-10 2011-01-17  P
 6:  a 2012-04-24 2015-07-20  D
 7:  a 2012-04-24 2013-01-01  P
 8:  a 2012-05-15 2015-07-20  E
 9:  a 2014-03-20 2015-12-04  F
10:  a 2014-04-22 2020-12-31  G
11:  a 2017-11-30 2020-12-31  H

I selected the dx of pregnancy (P) and made a DB apart. And a DB with the dx different from P like this:

Pregnancies <- copy(DT[dx== "P"])
Pregnancies 

    id InD_dx     EndD_dx   dx
 1: a 2008-04-07 2008-11-05 P
 2: a 2010-05-10 2011-01-17 P
 3: a 2012-04-24 2013-01-01 P

Dx_Other_than_Pregnancies <- copy(DT[dx!= "P"])
Dx_Other_than_Pregnancies

   id InD_dx     EndD_dx    dx
 1: a 2005-10-04 2020-12-31 A
 2: a 2005-10-06 2020-12-31 B
 3: a 2005-10-06 2020-12-31 C
 4: a 2012-04-24 2015-07-20 D
 5: a 2012-05-15 2015-07-20 E
 6: a 2014-03-20 2015-12-04 F
 7: a 2014-04-22 2020-12-31 G
 8: a 2017-11-30 2020-12-31 H

I was expecting to merge them and group active diagnosis by pregnancy. Thats why I attemped with foverlap:

setkey(Dx_Other_than_Pregnancies, id, InD_dx, EndD_dx)

foverlaps(Pregnancies, Dx_Other_than_Pregnancies,
          by.x=c('id', 'InD_dx', 'EndD_dx'), type='within', mult= "all", nomatch = 0L)

and Non-Equi Joins:

   Dx_Other_than_Pregnancies[Pregnancies, on=.(id, dx_Ini<=dxIni , EndD_dx>=EndD_dx)]

and both end up like:

     id     InD_dx    EndD_dx dx   i.InD_dx  i.EndD_dx i.dx
 1:  a 2005-10-04 2020-12-31  A 2008-04-07 2008-11-05    P
 2:  a 2005-10-06 2020-12-31  B 2008-04-07 2008-11-05    P
 3:  a 2005-10-06 2020-12-31  C 2008-04-07 2008-11-05    P
 4:  a 2005-10-04 2020-12-31  A 2010-05-10 2011-01-17    P
 5:  a 2005-10-06 2020-12-31  B 2010-05-10 2011-01-17    P
 6:  a 2005-10-06 2020-12-31  C 2010-05-10 2011-01-17    P
 7:  a 2005-10-04 2020-12-31  A 2012-04-24 2013-01-01    P
 8:  a 2005-10-06 2020-12-31  B 2012-04-24 2013-01-01    P
 9:  a 2005-10-06 2020-12-31  C 2012-04-24 2013-01-01    P
10:  a 2012-04-24 2015-07-20  D 2012-04-24 2013-01-01    P

In the case with Non Equi Joins i.InD_dx and i.EndD_dx it's not outputed and EndD_dx becomes i.EndD_dx

Desired outcome

    id   InD_dx    EndD_dx   dx   i.InD_dx  i.EndD_dx i.dx
 1:  a 2005-10-04 2020-12-31  A 2008-04-07 2008-11-05    P
 2:  a 2005-10-06 2020-12-31  B 2008-04-07 2008-11-05    P
 3:  a 2005-10-06 2020-12-31  C 2008-04-07 2008-11-05    P
 4:  a 2005-10-04 2020-12-31  A 2010-05-10 2011-01-17    P
 5:  a 2005-10-06 2020-12-31  B 2010-05-10 2011-01-17    P
 6:  a 2005-10-06 2020-12-31  C 2010-05-10 2011-01-17    P
 7:  a 2005-10-04 2020-12-31  A 2012-04-24 2013-01-01    P
 8:  a 2005-10-06 2020-12-31  B 2012-04-24 2013-01-01    P
 9:  a 2005-10-06 2020-12-31  C 2012-04-24 2013-01-01    P
10:  a 2012-04-24 2015-07-20  D 2012-04-24 2013-01-01    P
11:  a 2012-05-15 2015-07-20  E 2012-04-24 2013-01-01    P

I dunno if I'm complicating things a little bit too much. Probably with a join 1:multiple and making differences between dates I would obtain what I want. This way it's efficient enough?

Any help would be appreciatated

Thanks in advance!

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213

1 Answers1

1

type = 'within' is excluding some partial overlaps you're looking for.

Try:

DT = data.table(id,InD_dx,EndD_dx, dx)

setkey(DT,id,InD_dx,EndD_dx)

foverlaps(DT[dx=='P'],DT[dx!='P'],
          by.x = c("id", "InD_dx", "EndD_dx"),
          by.y = c("id", "InD_dx", "EndD_dx"))

   id     InD_dx    EndD_dx dx   i.InD_dx  i.EndD_dx i.dx
 1:  a 2005-10-04 2020-12-31  A 2008-04-07 2008-11-05    P
 2:  a 2005-10-06 2020-12-31  B 2008-04-07 2008-11-05    P
 3:  a 2005-10-06 2020-12-31  C 2008-04-07 2008-11-05    P
 4:  a 2005-10-04 2020-12-31  A 2010-05-10 2011-01-17    P
 5:  a 2005-10-06 2020-12-31  B 2010-05-10 2011-01-17    P
 6:  a 2005-10-06 2020-12-31  C 2010-05-10 2011-01-17    P
 7:  a 2005-10-04 2020-12-31  A 2012-04-24 2013-01-01    P
 8:  a 2005-10-06 2020-12-31  B 2012-04-24 2013-01-01    P
 9:  a 2005-10-06 2020-12-31  C 2012-04-24 2013-01-01    P
10:  a 2012-04-24 2015-07-20  D 2012-04-24 2013-01-01    P
11:  a 2012-05-15 2015-07-20  E 2012-04-24 2013-01-01    P
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • also note that you can always specify whatever non-equi join you want using x[y] notation. foverlaps is really just a convenience function for x[y] non-equijoins, implemented in the exact style of a specific Bioconductor function – Michael Apr 02 '21 at 18:14
  • @Michael, overlaps can be found with x[y], but scenarios #1, #2 & #4 described in OP's question have to be tested, which makes `foverlaps` more practical. – Waldi Apr 05 '21 at 09:02