1

I need to identify instances when two events occur within a specific time epoch as follows. If a event A occurs first, the event B must occur within 24 hours. On the other hand, if B occurs first, then A need to be found within 72 hours. Also, when the criteria is met, I need the "onset" time, which is time at which the first of these events occurred.

Event A

structure(list(fake_id = c("1000686267", "1000686267", "1000686267", 
"1000686267", "1000686267", "1000686267", "1000686267", "1070640921", 
"1070640921", "1070640921", "1070640921", "1070640921", "1070640921", 
"1184695414", "1184695414", "1184695414", "1184695414", "1184695414"
), date = structure(c(1515063600, 1514822400, 1514822400, 1514822400, 
1514822400, 1515146400, 1514901600, 1515330000, 1514822400, 1514822400, 
1514822400, 1514822400, 1517385600, 1516701600, 1515142800, 1515178800, 
1515178800, 1516557600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
-18L), class = "data.frame", .Names = c("fake_id", 
"date"))

Event B

structure(list(fake_id = c("1000686267", "1000686267", "1000686267", 
"1000686267", "1000686267", "1000686267", "1000686267", "1000686267", 
"1000686267", "1000686267", "1000686267", "1000686267", "1000686267", 
"1000686267", "1000686267", "1000686267", "1000686267", "1070640921", 
"1070640921", "1070640921", "1070640921", "1070640921", "1070640921", 
"1184695414", "1184695414", "1184695414", "1184695414", "1184695414", 
"1184695414", "1184695414"), date = structure(c(1516795200, 1516795200, 
1516795200, 1516917600, 1517400000, 1517400000, 1515492000, 1515492000, 
1516190400, 1516190400, 1517410800, 1517410800, 1516921200, 1515070800, 
1515070800, 1515052800, 1516633200, 1517374800, 1515322800, 1515322800, 
1516525200, 1515232800, 1516543200, 1516550400, 1515189600, 1516543200, 
1516543200, 1515142800, 1515142800, 1515142800), class = c("POSIXct", 
"POSIXt"), tzone = "UTC")), row.names = c(NA, -30L), class = "data.frame", .Names = c("fake_id", 
"date"))

Some code


 library (data.table)

 event_a <- data.table(event_a[, c("fake_id", "date"), with = FALSE])
 event_b <- data.table(event_b[, c("fake_id", "date"), with = FALSE])

 event_a[, `:=`("criteria_a", "criteria_a")]
 event_b[, `:=`("criteria_b", "criteria_b")]

 setkeyv(event_a, c("fake_id", "date"))
 setkeyv(event_b, c("fake_id", "date"))

 join_window <- 60 * 60 * c(24, 72)

 event_subset_a <- event_a[event_b, roll = join_window[1]]
 event_subset_b <- event_b[event_a, roll = join_window[2]]

 event_df <- rbind(event_subset_a, event_subset_b)
 event_df[, `:=`(c("criteria_a", "criteria_b"),  NULL)]

 setkeyv(event_df, c("fake_id", "date"))
 event_df <- unique(event_df)

Current output

      fake_id                date
1  1184695414 2018-01-05 09:00:00
2  1184695414 2018-01-05 19:00:00
3  1184695414 2018-01-05 22:00:00
4  1184695414 2018-01-21 14:00:00
5  1184695414 2018-01-21 16:00:00
6  1184695414 2018-01-21 18:00:00
7  1184695414 2018-01-23 10:00:00

Desired output

      fake_id                date
1  1184695414 2018-01-05 09:00:00
2  1184695414 2018-01-21 14:00:00
3  1184695414 2018-01-23 10:00:00
PavoDive
  • 6,322
  • 2
  • 29
  • 55
user111074
  • 11
  • 3
  • Why are you mixing `data.table` and `dplyr` syntax? – Maurits Evers Sep 12 '19 at 23:43
  • It would be useful if you could show what your expected output would look like? – Ronak Shah Sep 12 '19 at 23:51
  • I'm away from my computer now, but recently I provided an answer to a somewhat similar question. You may want to look at it and see if you can apply some of the functions, particularly `data.table::foverlaps` and non-equi joins. https://stackoverflow.com/questions/57876463/non-equi-joins-comparing-two-data-frames-in-r?noredirect=1#comment102178046_57876463 – PavoDive Sep 13 '19 at 00:28
  • @MauritsEvers, sorry! It's a force of habit as big fan of dplyr. – user111074 Sep 13 '19 at 00:43
  • @RonakShah, hope this helps. Thanks! – user111074 Sep 13 '19 at 00:44
  • @PavoDive, I'll take a look at your previous answer related to this question. Thanks! – user111074 Sep 13 '19 at 00:45

2 Answers2

1

At first I thought this problem needed to be addresed with a non-equi join, but then I realized that a standard join is sufficient.

The overall process would be like this:

  1. Eliminate duplicated rows
  2. Join both tables
  3. Filter those in which condition A appeared first. Mark them as "type A" and establish the onset time.
  4. Filter those in which condition B appeared first. Mark them as "type B", and establish the onset time.
  5. Drop the un-marked rows.

.

library(data.table)
library(lubridate)  # we'll use the dhours() function

setDT(eventA, key = "fake_id")
setDT(eventB, key = "fake_id")

Modify the name of columns so it's easier to understand what belongs where

setnames(eventA, "date", "dateA")
setnames(eventB, "date", "dateB")

Eliminate duplicated rows

eventA <- eventA[!duplicated(eventA), ]
eventB <- eventB[!duplicated(eventB), ]

Join both tables and with chaining do steps 2 - 4 of the overall plan

eventA[eventB, 
       allow.cartesian = TRUE][
          dateA < dateB & dateB <= dateA + dhours(24), 
          `:=` (type = "A", 
                onset = dateA)][
                    dateB < dateA & dateA <= dateB + dhours(72), 
                    `:=` (type = "B", 
                          onset = dateB)][!is.na(type), ][]

       fake_id               dateA               dateB type               onset
 1: 1000686267 2018-01-04 11:00:00 2018-01-04 08:00:00    B 2018-01-04 08:00:00
 2: 1000686267 2018-01-05 10:00:00 2018-01-04 08:00:00    B 2018-01-04 08:00:00
 3: 1000686267 2018-01-04 11:00:00 2018-01-04 13:00:00    A 2018-01-04 11:00:00
 4: 1000686267 2018-01-05 10:00:00 2018-01-04 13:00:00    B 2018-01-04 13:00:00
 5: 1070640921 2018-01-07 13:00:00 2018-01-06 10:00:00    B 2018-01-06 10:00:00
 6: 1070640921 2018-01-07 13:00:00 2018-01-07 11:00:00    B 2018-01-07 11:00:00
 7: 1070640921 2018-01-31 08:00:00 2018-01-31 05:00:00    B 2018-01-31 05:00:00
 8: 1184695414 2018-01-05 19:00:00 2018-01-05 09:00:00    B 2018-01-05 09:00:00
 9: 1184695414 2018-01-05 09:00:00 2018-01-05 22:00:00    A 2018-01-05 09:00:00
10: 1184695414 2018-01-05 19:00:00 2018-01-05 22:00:00    A 2018-01-05 19:00:00
11: 1184695414 2018-01-21 18:00:00 2018-01-21 14:00:00    B 2018-01-21 14:00:00
12: 1184695414 2018-01-23 10:00:00 2018-01-21 14:00:00    B 2018-01-21 14:00:00
13: 1184695414 2018-01-21 18:00:00 2018-01-21 16:00:00    B 2018-01-21 16:00:00
14: 1184695414 2018-01-23 10:00:00 2018-01-21 16:00:00    B 2018-01-21 16:00:00

The output is very different from your expected output, but looking at your data and to the rules you esablished (if A earlier than B and B within 24 h of A, then A. If B earlier than A and A within 72 h of B, then B) there are 11 additional matches to the ones you found (in other words: either your expected output is wrong, or your established rules are wrong).

PavoDive
  • 6,322
  • 2
  • 29
  • 55
0

This is similar to @PavoDive but focuses on creating the non-equi join criteria before the actual join:

library (data.table)
setDT(event_a)
setDT(event_b)

# for the join - eventB needs to be within -72 to 24 hours
event_a[, `:=`(min_date = date - 72*60*60,
               max_date = date + 24*60*60)]

# join unique data.tables
unique(event_b)[unique(event_a),
           #non-equi join conditions
                on = .(fake_id = fake_id,
                       date > min_date,
                       date < max_date),
                nomatch = 0L,
                allow.cartesian = T,
            #select columns - you would only include fake_id and onset for desired output
                j = .(fake_id,
                      a_date = i.date,
                      b_date = x.date,
                      onset = pmin(i.date, x.date),
                      first_type = ifelse(i.date == x.date,
                                          NA_character_,
                                          ifelse(i.date < x.date,
                                                 'A',
                                                 'B'))
                      )
                ]

       fake_id              a_date              b_date               onset first_type
 1: 1000686267 2018-01-04 11:00:00 2018-01-04 13:00:00 2018-01-04 11:00:00          A
 2: 1000686267 2018-01-04 11:00:00 2018-01-04 08:00:00 2018-01-04 08:00:00          B
 3: 1000686267 2018-01-05 10:00:00 2018-01-04 13:00:00 2018-01-04 13:00:00          B
 4: 1000686267 2018-01-05 10:00:00 2018-01-04 08:00:00 2018-01-04 08:00:00          B
 5: 1070640921 2018-01-07 13:00:00 2018-01-07 11:00:00 2018-01-07 11:00:00          B
 6: 1070640921 2018-01-07 13:00:00 2018-01-06 10:00:00 2018-01-06 10:00:00          B
 7: 1070640921 2018-01-31 08:00:00 2018-01-31 05:00:00 2018-01-31 05:00:00          B
 8: 1184695414 2018-01-23 10:00:00 2018-01-21 16:00:00 2018-01-21 16:00:00          B
 9: 1184695414 2018-01-23 10:00:00 2018-01-21 14:00:00 2018-01-21 14:00:00          B
10: 1184695414 2018-01-05 09:00:00 2018-01-05 22:00:00 2018-01-05 09:00:00          A
11: 1184695414 2018-01-05 09:00:00 2018-01-05 09:00:00 2018-01-05 09:00:00       <NA>
12: 1184695414 2018-01-05 19:00:00 2018-01-05 22:00:00 2018-01-05 19:00:00          A
13: 1184695414 2018-01-05 19:00:00 2018-01-05 09:00:00 2018-01-05 09:00:00          B
14: 1184695414 2018-01-21 18:00:00 2018-01-21 16:00:00 2018-01-21 16:00:00          B
15: 1184695414 2018-01-21 18:00:00 2018-01-21 14:00:00 2018-01-21 14:00:00          B

The difference in output is that in row 11, the start times are the same. My join criteria doesn't capture this because data.table currently doesn't support not equal.

Cole
  • 11,130
  • 1
  • 9
  • 24