2

Am I trying to do the impossible? I want to match events in df1 with events in df2 if event2 or the period of 10 days before event2 intersects with a date in event1. I have pasted samples from the two datasets. I have looked and cannot find anything similar to this question on this forum, so maybe this is not possible. Thank you in advance!

head(df1)
    # A tibble: 6 x 1
    # Groups:   event1 [6]
      event1
       <date>     
    1 1980-01-10 
    2 1980-01-13 
    3 1980-01-14 
    4 1980-02-18 
    5 1980-02-27 
    6 1980-03-02 

head(df2)

      event2
    1  1980-01-16
    2  1980-01-18
    3  1980-01-19
    4  1980-02-12
    5  1980-09-26
    6  1980-10-23

I think what I want is something like this (using the first three event2s):

ev_1 <- interval(ymd('1980-01-06'), ymd('1980-01-16'))
ev_2 <- interval(ymd('1980-01-08'), ymd('1980-01-18')) 
ev_3 <- interval(ymd('1980-01-09'), ymd('1980-01-19'))

Then, I want to see if any of the event1 dates take place during an interval period. In total, I have about 60 event2 dates and hundreds of event1 dates over a 40 year period.

I was able to come up with this using instructions here, but is this the best approach? If so, is it possible to automate it so that I don't have to hand write all 60 intervals?

> dates_test <- ymd(c("1980-01-10", "1980-01-13", "1980-01-14", "1980-02-18"))
> interval_test<- list(interval(ymd('1980-01-06'), ymd('1980-01-16')),
                       interval(ymd('1980-01-09'), ymd('1980-01-19')))
> dates_test %within% interval_test
[1]  TRUE  TRUE  TRUE FALSE
Parit
  • 55
  • 7
  • 1
    Is the period you are describing an interval starting 10 days before event2 and ends at event2? – Samuel Aug 04 '20 at 22:17
  • Yes. The dates for event1 are discrete days. For event2, I want the prior 10 day interval. – Parit Aug 04 '20 at 22:25
  • Thanks for mentioning "interval." Using that term as a search term led me to: https://stackoverflow.com/questions/41497351/check-if-a-date-is-within-an-interval-in-r. I think there is an answer in there that fits for me. There are quite a few options, it seems! – Parit Aug 04 '20 at 22:50

2 Answers2

1

You can create all possible combinations of event1 and event2 then keep rows when the event2 is 10 days or less after event1.

combinations <- expand.grid(df1$event1, df2$event2)
matches <- combinations[combinations[,2] >= combinations[,1] & combinations[,2] - combinations[,1] <= 10,]
matches

         Var1       Var2
1  1980-01-10 1980-01-16
2  1980-01-13 1980-01-16
3  1980-01-14 1980-01-16
7  1980-01-10 1980-01-18
8  1980-01-13 1980-01-18
9  1980-01-14 1980-01-18
13 1980-01-10 1980-01-19
14 1980-01-13 1980-01-19
15 1980-01-14 1980-01-19
marcguery
  • 441
  • 1
  • 4
  • 12
1

The OP has asked two questions:

  1. Is using the %within% operator from lubridate the best approach?
  2. Is it possible to automate it so that the OP does not have to hand write all 60 intervals?

To answer the second question first: Yes, it is possible:

%within%, lapply() and interval()

The OP was almost there. According to the documentation of a %within% b,

If b is a list of intervals, a is checked if it falls within any of the intervals

We can create a list of intervals from the given date vector df2$event2 by

lapply(df2$event2, function(x) interval(x - 10, x))
[[1]]
[1] 1980-01-06 UTC--1980-01-16 UTC

[[2]]
[1] 1980-01-08 UTC--1980-01-18 UTC

[[3]]
[1] 1980-01-09 UTC--1980-01-19 UTC

[[4]]
[1] 1980-02-02 UTC--1980-02-12 UTC

[[5]]
[1] 1980-09-16 UTC--1980-09-26 UTC

[[6]]
[1] 1980-10-13 UTC--1980-10-23 UTC

The start date of each interval is computed from the end date by subtracting 10 days. So,

library(lubridate)
df1$event1 %within% lapply(df2$event2, function(x) interval(x - 10, x))

returns a logical vector (in line with OP's expected result)

[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

which can be used to subset df1 to pick the matching events from df1 as a vector of dates

df1[df1$event1 %within% lapply(df2$event2, function(x) interval(x - 10, x)), ]
[1] "1980-01-10" "1980-01-13" "1980-01-14"

or

df1[df1$event1 %within% lapply(df2$event2, function(x) interval(x - 10, x)), , drop = FALSE]

which returns the subsetted data.frame.

      event1
1 1980-01-10
2 1980-01-13
3 1980-01-14

%inrange% from data.table

For the sake of completeness, the data.table package provides a similar operator %inrange:

library(data.table)
setDT(df1)
setDT(df2)
df1[event1 %inrange% df2[, .(event2 - 10L, event2)]]
       event1
1: 1980-01-10
2: 1980-01-13
3: 1980-01-14

setDT(df1) and setDT(df2) coerce data.frames to data.table objects.

Benchmarking

Now, we can try to answer OP's first question concerning "the best approach".

The OP has not specified the criteria to judge an approach as "the best". Probably, the OP was mainly concerned about the effort to write 60 intervals by hand.

Now, this issue has ben solved so let's compare the three different approaches posted so far in terms of execution speed:

  1. %within% and interval() from lubridate
  2. expand.grid() suggested by marcguery
  3. %inrange% from data.table

For benchmarking, the bench package is used because it measures exection times as well as memory allocation for varying problem sizes. It also checks that the results are identical. Therefore, the three approaches are modified to return a vector of dates.

library(bench)
library(ggplot2)
bm <- press(
  n1 = c(100L, 1E3L, 1E4L),
  n2 = c(10L, 100L, 1000L),
  {
    beg <- as.Date("1980-01-01")
    end <- as.Date("2020-12-31")
    df1 <- data.frame(event1 = seq(beg, end, length.out = n1))
    df2 <- data.frame(event2 = seq(beg, end, length.out = n2))
    dt1 <- as.data.table(df1)
    dt2 <- as.data.table(df2)
    mark(
      within = {
        df1[df1$event1 %within% lapply(df2$event2, function(x) interval(x - 10, x)), ]
      },
      inrange = {
        dt1[event1 %inrange% dt2[, .(event2 - 10L, event2)], event1]
      },
      exp.grid = {
        combinations <- expand.grid(df1$event1, df2$event2)
        matches <- combinations[combinations[,2] >= combinations[,1] & 
                       combinations[,2] - combinations[,1] <= 10,]
        unique(matches[[1L]])
      },
      check = TRUE
    )
  }
)
autoplot(bm)

enter image description here

Please, note the logarithmic time scale.

Only for the smallest problem sizes, the expand.grid() approach is the fastest. For all other problem sizes (including the case of 1000 event1 and 100 event2 which is close to OP's problem size), data.table's %inrange% is the fastest. For the largest case with 10000 event1 and 1000 event2, data.table is more than 2 magnitudes faster than the other approaches.

library(dplyr)
bm %>% 
  select(1:11) %>% 
  filter(n1 == max(n1), n2 == max(n2)) %>% 
  mutate(expression = names(expression) %>% unique())
# A tibble: 3 x 11
  expression    n1    n2      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <chr>      <int> <int> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 within     10000  1000 780.16ms  780.2ms      1.28     307MB     2.56     1     2      780ms
2 inrange    10000  1000   2.68ms    3.3ms    293.       491KB     0      147     0      502ms
3 exp.grid   10000  1000 834.35ms  834.3ms      1.20     882MB     3.60     1     3      834ms

Furthermore, data.table allocates 3 magnitudes less memory (0.5MB vs 307MB or 882MB, resp.).

Uwe
  • 41,420
  • 11
  • 90
  • 134