Determine presence and extent of overlapping date ranges by ID number - two data frames

Question

I have two data frames as follows. They are of unequal length:

library(lubridate)

id <- c(1, 2, 2, 2, 2, 3, 4, 4, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9,
    10, 10, 10, 11, 11, 12, 13, 14, 15, 15, 5451396, 5451396, 5451396, 5451396, 5451396)
admDt <- ymd(c("2000-02-24", "2000-04-30", "2000-06-06", "2001-01-29", "2004-06-10", "2001-05-21",
           "2000-01-25", "2000-04-18", "2000-01-14", "1991-10-06", "1992-02-25", "2000-05-17",
           "2003-06-06", "2009-02-16", "2000-01-23", "2000-03-10", "2000-04-05", "2000-06-16",
           "2000-07-04", "2000-07-27", "2001-01-19", "2002-08-16", "2002-09-19", "2004-04-17",
           "2005-08-02", "2005-09-21", "2006-07-10", "2000-02-24", "2000-05-05", "2000-08-29",
           "2001-01-24", "2000-01-27", "2000-03-09", "2000-04-15", "2000-03-20", "2002-11-13",
           "2000-06-28", "2000-07-02", "2000-06-13", "1999-12-27", "2008-09-10", "2000-04-09",
           "2000-06-01", "2002-11-25", "2006-08-04", "2006-10-07"))
sepDt <- ymd(c("2000-02-25", "2000-05-25", "2000-06-06", "2001-02-15", "2004-07-12", "2001-06-01",
           "2000-01-31", "2000-04-20", "2000-01-31", "1991-11-07", "1992-03-26", "2000-05-31",
           "2003-06-17", "2009-02-23", "2000-03-06", "2000-03-17", "2000-04-06", "2000-06-28",
           "2000-07-17", "2000-07-31", "2002-04-19", "2002-09-11", "2003-05-06", "2004-05-03",
           "2005-08-31", "2006-05-29", "2009-06-19", "2000-03-09", "2000-05-06", "2000-09-12",
           "2001-01-24", "2000-02-15", "2000-03-17", "2000-04-16", "2000-04-20", "2002-12-05",
           "2000-07-27", "2000-08-15", "2000-06-22", "2000-02-12", "2008-09-17", "2000-05-26",
           "2000-08-29", "2003-02-24", "2006-09-22", "2006-11-10"))
adm <- data.frame(id, admDt, sepDt)

id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 5451396)
birthDt <- ymd(c("1971-07-22", "1982-08-09", "1976-01-30", "1972-02-03", "1958-05-26", "1979-05-24",
             "1971-11-03", "1980-02-05", "1978-06-08", "1969-10-14", "1962-01-01", "1977-03-09",
             "1952-01-24", "1974-12-16", "1956-05-05", "1963-07-16"))
dxDt <- ymd(c("2000-02-24", "2000-04-30", "2000-03-03", "2000-01-31", "2000-06-20", "2000-12-13",
          "2000-05-14", "2000-01-23", "2000-03-09", "2000-02-15", "2000-05-01", "2000-06-30",
          "2000-08-15", "2000-06-22", "2000-01-27", "2000-06-01"))
admPreDx <- c("No", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "Yes","Yes", "Yes", "Yes",
          "Yes", "Yes", "Yes")
admPreDxNbr <- c(0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1)
admPreDxDur <- c(0, 0, 0, 6, 0, 0, 0, 0, 14, 19, 20, 2, 31, 9, 31, 25)
admPostDx <- c("Yes", "Yes", "No", "No", "No", "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "No",
           "No", "Yes", "Yes")
admPostDxNbr <- c(1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 3)
admPostDxDur <- c(1, 25, 0, 0, 0, 0, 14, 31, 0, 6, 0, 27, 0, 0, 16, 31)
admDx <- data.frame(id, birthDt, dxDt, admPreDx, admPreDxNbr, admPreDxDur, admPostDx, admPostDxNbr,
                admPostDxDur)


> head(adm)
  id      admDt      sepDt
1  1 2000-02-24 2000-02-25
2  2 2000-04-30 2000-05-25
3  2 2000-06-06 2000-06-06
4  2 2001-01-29 2001-02-15
5  2 2004-06-10 2004-07-12
6  3 2001-05-21 2001-06-01

> head(admDx)
  id    birthDt       dxDt admPreDx admPreDxNbr admPreDxDur admPostDx admPostDxNbr admPostDxDur
1  1 1971-07-22 2000-02-24       No           0           0       Yes            1            1
2  2 1982-08-09 2000-04-30       No           0           0       Yes            1           25
3  3 1976-01-30 2000-03-03       No           0           0        No            0            0
4  4 1972-02-03 2000-01-31      Yes           1           6        No            0            0
5  5 1958-05-26 2000-06-20       No           0           0        No            0            0
6  6 1979-05-24 2000-12-13       No           0           0        No            0            0

The actual datasets range from 10,000 to 1,000,000+ rows.

Each row in adm refers to a discrete hospital admission. Note: id is the patient's ID number, while admDt and sepDt refer to the admission and discharge dates, respectively. Some patient's have multiple admissions.

Each row in admDx refers to a single patient: id is the patient's ID number (consistent with that provided in adm), while birthDt and dxDt are the patient's birth and diagnosis dates, respectively.

I am conducting some longitudinal / time series analyses and would like to determine whether patients were or were not hospitalised during different time periods pre- and post- diagnosis. For the sake of brevity, this question is concerned with the one month before and after diagnosis. Ideally, I would like to:

Create a dichotomous variable ("Yes" / "No"), indicating whether a given patient spent time in hospital during the time period (i.e., I am not concerned if they were admitted before the onset of the time period or if they were discharged after the offset of the time period)
Calculate the number of times each patient was in hospital during the time period
Calculate the duration (number of days) each patient was in hospital during the time period

I have reviewed a number of posts over several days (e.g., R Time periods overlapping, Join dataframes by id and overlapping date range, how to show an event happened between two dates in R); however, none of them seem to combine the three aspects I am interested in (calculating time between overlapping dates; multiple data frames; by "group" [or individual]).

I am new to R and have had little experience with loops and more advanced formulae. It seems that it may be possible to use foverlaps, lubridate, or %overlaps% from the "DescTools" package; however, I am unsure how to construct the relevant formulae.

Any assistance would be greatly appreciated!

EDIT #1:

While @sirallen's suggestions worked for the specific time period in the example provided, sum(pmin(dxDt, sepDt) - pmax(admDt, dxDt)), by = "id" returned inaccurate values in my real dataset (e.g., patients with multiple admissions of duration one day ["2000-01-25" - "2000-01-26"] reportedly spent zero days in hospital. This seems strange to me, since the code seems to be be used to answer similar examples. Does this issue relate to the fact that I have several overlapping date ranges for those patients? Further, as noted by @sirallen, the code did not highlight when patients had one or more than one admission during the time period.

The code below provided a more direct route to the first two parts of my question, by determining a) whether a patient spent time in hospital and b) the number of admissions:

library(data.table)
setDT(adm)
setDT(admDx)[, (4:9) := NULL]

#Period bounds
admDx[, `:=`(dxDtN1 = dxDt %m-% months(1), dxDtP1 = dxDt %m+% months(1))]

#Hospitalised in the month preceding diagnosis
admDx <- adm[admDx, on = .(id, admDt < dxDt, sepDt > dxDtN1), .N, by = .EACHI]
admDx[, `:=` (admPreDx = factor(ifelse(N > 0, "Yes", "No")))]

However, the pmin / pmax code still does not work, returning negative values:

admDx[, `:=` (birthDt = birthDt, dxDt = dxDt, dxDtN1 = dxDt %m-% months(1), dxDtP1 = dxDt %m+% months(1))]
admDx[, `:=` (admPreDxDur=as.numeric(sum(pmin(dxDt, adm$sepDt) - pmax(dxDtN1, adm$admDt)))), by = "id"]
admDx <- select(admDx, admPreDx, N, admPreDxDur)


> head(admDx)
   admPreDx N admPreDxDur
1:       No 0      -28573
2:       No 0      -27160
3:       No 0      -28366
4:      Yes 1      -29357
5:       No 0      -26701
6:       No 0      -28044

EDIT #2

After testing additional cases, it seems that the issue re: pmin / pmax may relate to the use of > vs >=: when > is used, the correct Dur value is returned; however, when >= is used, Dur returns a value of 0.

How might this code be adapted to enable me to calculate the number of days up to, and including, the diagnosis date?

Instead of describing your expected output, show it (make sure it reflects your examples) — Sotos, Dec 22 '16 at 09:45
@Sotos Thanks for the feedback. I have included the expected output in the second data frame — C_H, Dec 22 '16 at 10:38
If you want to include the diagnosis date, just add `+ (dxDt <= sepDt)` for preDx (`+ (dxDt >= admDt)` for postDx) — sirallen, Jan 02 '17 at 07:19

sirallen · Accepted Answer · 2017-01-02T07:11:11.590

We can do this with non-equi joins in data.table (>=v1.9.8):

library(data.table)
setDT(adm)
setDT(admDx)[, (4:9):= NULL]

# period bounds
admDx[, `:=`(dxDtLo=dxDt-31, dxDtHi=dxDt+31)]

# hospitalized pre-dxnosis?
admDx = adm[, .(id, admDt, sepDt, dxDt=admDt, dxDtLo=sepDt)][admDx,
  on=.(id, dxDt < dxDt, dxDtLo > dxDtLo)]
admDx[, admPreDx:= as.numeric(!is.na(admDt))]
admDx[, `:=`(admPreDxNbr=sum(admPreDx), admPreDxDur=as.numeric(
  sum(pmin(dxDt,sepDt) - pmax(admDt,dxDtLo)))), by='id']
admDx[, c('admDt','sepDt'):= NULL]

# hospitalized post-dxnosis?
admDx = adm[, .(id, admDt, sepDt, dxDtHi=admDt, dxDt=sepDt)][admDx,
  on=.(id, dxDtHi < dxDtHi, dxDt > dxDt)]
admDx[, admPostDx:= as.numeric(!is.na(admDt))]
admDx[, `:=`(admPostDxNbr=sum(admPostDx), admPostDxDur=as.numeric(
  sum(pmin(sepDt,dxDtHi) - pmax(dxDt,admDt)))), by='id']
admDx[, c('admDt','sepDt'):= NULL]

admDx[is.na(admDx)] = 0
admDx = unique(admDx)[, c('dxDtLo','dxDtHi'):= NULL]

> admDx
#          id       dxDt    birthDt admPreDx admPreDxNbr admPreDxDur admPostDx admPostDxNbr admPostDxDur
#  1:       1 2000-02-24 1971-07-22        0           0           0         1            1            1
#  2:       2 2000-04-30 1982-08-09        0           0           0         1            1           25
#  3:       3 2000-03-03 1976-01-30        0           0           0         0            0            0
#  4:       4 2000-01-31 1972-02-03        1           1           6         0            0            0
#  5:       5 2000-06-20 1958-05-26        0           0           0         0            0            0
#  6:       6 2000-12-13 1979-05-24        0           0           0         0            0            0
#  7:       7 2000-05-14 1971-11-03        0           0           0         1            1           14
#  8:       8 2000-01-23 1980-02-05        0           0           0         1            1           31
#  9:       9 2000-03-09 1978-06-08        1           1          14         0            0            0
# 10:      10 2000-02-15 1969-10-14        1           1          19         1            1            8
# 11:      11 2000-05-01 1962-01-01        1           1          20         0            0            0
# 12:      12 2000-06-30 1977-03-09        1           1           2         1            1           27
# 13:      13 2000-08-15 1952-01-24        1           1          31         0            0            0
# 14:      14 2000-06-22 1974-12-16        1           1           9         0            0            0
# 15:      15 2000-01-27 1956-05-05        1           1          31         1            1           16
# 16: 5451396 2000-06-01 1963-07-16        1           1          25         1            1           31

Thanks for your help. I have had some difficulty with the pmin / pmax aspect of the code. What impact would the unique filter have on the Nbr and Dur columns? — C_H, Jan 02 '17 at 02:17
Can you update your sample data w an id for which the above code fails? thanks. I think my code will generate duplicate rows when there are multiple admissions; `unique` will get rid of those — sirallen, Jan 02 '17 at 03:11
I added an additional case to the dataset. My apologies - once I added the case, I realised I obtained different results depending on whether the start / end dates of the time periods (e.g., admDt, dxDt) overlapped. I do not understand why that subtle different would impact pmin / pmax; the two aspects of the code seem to be independent... — C_H, Jan 02 '17 at 05:21
I modified the answer to correctly account for multiple admissions. (Turns out the single-operation join/assign did not work as I expected and only picked up the first hospital stay -- see http://stackoverflow.com/questions/41421640/assignment-with-data-table-join-operation-multiple-matches/.) As for the case you added, it looks like there is still only one stay within the 1-month period, so `admPostDxDur` should be 31... ? — sirallen, Jan 02 '17 at 06:58
The join is more convoluted now bc of the way names/values are inherited in the result -- see https://github.com/Rdatatable/data.table/issues/1807. So `dxDt < dxDt, dxDtLo > dxDtLo` actually means `admDt < dxDt, sepDt > dxDtLo` — sirallen, Jan 02 '17 at 07:02

Determine presence and extent of overlapping date ranges by ID number - two data frames

1 Answers1