1

I am looking to look up the individual id in events_table and calculate the total_duration as the sum of the duration of all events prior to date.

The duration is the time between the date_start and date (table1), unless the event ended (i.e. has a date_end), in which case if date_end < date, duration = date_end - date_start.

In pseudo code:

IF (date>date_start) Then{
   IF(date_end < date & date_end != NA) Then{
       duration = date_end-date_start
   } else if (date_start < date) {
       duration = date - date_start
   }
}
Then sum all the durations separately for each "individual_id" and "date" combo

I am using data.tables as I have large tables (>1m rows).

My data looks a bit like this:

 table1 <- fread(
      "individual id | date       
       1             |  2019-01-02
       1             |  2019-01-03
       2             |  2019-01-02
       2             |  2019-01-03", 
      sep ="|"
    )
    events_table<- fread(
      "individual id | date_start  | date_end
       1             |  2018-01-02 |   NA     
       1             |  2018-01-04 | 2018-07-01     
       1             |  2018-01-05 |   NA       
       2             |  2018-01-01 |   NA         
       2             |  2018-01-02 |   NA           
       2             |  2018-01-05 | 2018-11-21",
      sep = "|"
    )

The output should be the following:

 table1 <- fread(
          "individual id | date         | total_duration
           1             |  2019-01-02  |    905
           1             |  2019-01-03  |    907
           2             |  2019-01-02  |    1051
           2             |  2019-01-03  |    1053", 
          sep ="|"
        )

My best guess at starting the query comes from:

table1[, total_duration:= events_table[table1, 
                              on = .(`individual id`, date>date_start), 
                              sum(date-date_start),
                              by = .EACHI][["V1"]]]

But I dont know the syntax for including the if condition.

Thanks for any help.

Laurence_jj
  • 646
  • 1
  • 10
  • 23

1 Answers1

1
# formatting
table1[, date := as.IDate(date)]
events_table[, `:=`(date_start = as.IDate(date_start), date_end = as.IDate(date_end))]

# list max dur
events_table[, dur := date_end - date_start]

# add up completed events
table1[, v1 := 
  events_table[.SD, on=.(`individual id`, date_end <= date), sum(x.dur, na.rm = TRUE), by=.EACHI]$V1
]

# add on incomplete events
table1[, v2 := 
  events_table[!is.na(date_end)][.SD, on=.(`individual id`, date_start <= date, date_end > date), sum(i.date - x.date_start, na.rm = TRUE), by=.EACHI]$V1
]

# add on ill-defined events
table1[, v3 := 
  events_table[is.na(date_end)][.SD, on=.(`individual id`, date_start <= date), sum(i.date - x.date_start, na.rm = TRUE), by=.EACHI]$V1
]

table1[, v := v1 + v2 + v3]

   individual id       date total_duration  v1 v2  v3    v
1:             1 2019-01-02            905 178  0 727  905
2:             1 2019-01-03            907 178  0 729  907
3:             2 2019-01-02           1051 320  0 731 1051
4:             2 2019-01-03           1053 320  0 733 1053

You don't have to define three distinct columns, though it is easier for debugging. Instead, you could initialize table1[, v := 0] and for each step do table1[, v := v + ...].

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    it looks exactly what I was hoping for, I will test it tomorrow, many thanks @frank – Laurence_jj Apr 17 '19 at 18:34
  • I am getting the error that the command to generate the duration for incomplete events (v2) is resulting in too many rows. `Join results in 19190493 rows; more than 1027748 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again.` As we are adding an additional condition (`date_end > date`) shouldnt it result in fewer rows? – Laurence_jj Apr 18 '19 at 08:17
  • Is it because we are joining on date twice? @Frank – Laurence_jj Apr 18 '19 at 08:26
  • @Laurence_jj You could add the argument `allow.cartesian=TRUE` to force the calculation; see `?data.table`. It's happening because some rows of i = table1 are matching to many rows of x = events_table[!is.na(date_end)] according to those on= conditions. I'm confused if v2 gives the error while v3 does not, since, like you said, v2 uses narrower on= conditions... – Frank Apr 18 '19 at 08:39
  • I tried using allow.cartesian=TRUE and it ran, however v2 gives me only zeros, where the data clearly should result in `date-date_start` giving more that zero. – Laurence_jj Apr 18 '19 at 08:59
  • is there a way to sum if (so to take the `date_end > date` out of the on condition) and to sum if. Such as something like `table1[, v2 := events_table[!is.na(date_end)][.SD, on=.(individual id, date_start <= date), sum(ifelse(date_end > date, i.date - x.date_start, 0), na.rm = TRUE), by=.EACHI]$V1 ]` – Laurence_jj Apr 18 '19 at 09:01
  • hmmm, that also works for the example, but gives zeros for my data. I will investigate further – Laurence_jj Apr 18 '19 at 09:09
  • I had a typo and missed the x. off the `date_start`. I works great!! – Laurence_jj Apr 18 '19 at 09:15
  • Re some way to sum if, yeah, I think the calculation can be done other ways than I've shown here. I would lean towards `pmin(x.date_end, i.date) - x.date_start` since ifelse has some weird/annoying behavior, eg, https://stackoverflow.com/q/16275149 – Frank Apr 18 '19 at 15:33