0

I have a problem figuring out how to extract the time of events that happened during another time of an event(in my case letters). I hope someone could help me out! in short: I would like the rows of bb tibble whose start or end or both(start and end)time is within the range of aa records. The end goal is to find how many times the letters occurred in aa tibble. Please, do let me know if more clarifications are needed! Thank you in advance!

aa <- tibble(
           start = as.POSIXct(c("2019-05-02 07:08:49", "2019-05-02 07:09:21",
                                    "2019-05-02 07:09:41", "2019-05-02 07:10:05",
                                    "2019-05-02 07:24:52", "2019-05-02 07:28:50",
                                    "2019-05-02 07:29:23", "2019-05-02 07:30:16",
                                    "2019-05-02 07:33:13", "2019-05-02 07:33:43",
                                    "2019-05-02 07:35:31", "2019-05-02 07:36:29",
                                    "2019-05-02 07:38:14", "2019-05-02 07:43:26",
                                    "2019-05-02 07:44:59", "2019-05-02 07:53:45",
                                    "2019-05-02 07:54:28")),
           end = as.POSIXct(c("2019-05-02 07:09:29", "2019-05-02 07:10:02",
                                    "2019-05-02 07:10:17", "2019-05-02 07:10:40",
                                    "2019-05-02 07:29:10", "2019-05-02 07:29:32",
                                    "2019-05-02 07:30:35", "2019-05-02 07:30:53",
                                    "2019-05-02 07:33:48", "2019-05-02 07:34:18",
                                    "2019-05-02 07:36:06", "2019-05-02 07:38:34",
                                    "2019-05-02 07:38:49", "2019-05-02 07:45:19",
                                    "2019-05-02 07:45:35", "2019-05-02 07:54:20",
                                    "2019-05-02 07:55:03")))

bb <- tibble(letters = sample(letters[1:4], 12, TRUE), 
             started = as.POSIXct(c("2019-05-02 07:30:23", "2019-05-02 07:30:56",
                                      "2019-05-02 07:31:29", "2019-05-02 07:31:55",
                                      "2019-05-02 07:32:22", "2019-05-02 07:32:48",
                                      "2019-05-02 07:33:14", "2019-05-02 07:44:36",
                                      "2019-05-02 07:45:11", "2019-05-02 07:45:36",
                                      "2019-05-02 07:46:01", "2019-05-02 07:48:14"
                                     )),
             stopped = as.POSIXct(c("2019-05-02 07:30:56", "2019-05-02 07:31:29",
                                    "2019-05-02 07:31:55", "2019-05-02 07:32:22",
                                    "2019-05-02 07:32:48", "2019-05-02 07:33:14",
                                    "2019-05-02 07:33:40", "2019-05-02 07:45:10",
                                    "2019-05-02 07:45:36", "2019-05-02 07:46:01",
                                    "2019-05-02 07:46:25", "2019-05-02 07:48:48")))
eyei
  • 402
  • 4
  • 12
  • 1
    Have you seen https://stackoverflow.com/questions/40647177/find-all-date-ranges-for-overlapping-start-and-end-dates-in-r? – MrFlick Jun 07 '19 at 19:37
  • I haven't but I find it interesting. Thank you! I may apply it soon. – eyei Jun 07 '19 at 21:09

2 Answers2

2

Here's an approach using fuzzyjoin, which lets you specify that bb$started must be >= aa$start, while bb$stopped <= aa$end.

library(fuzzyjoin); 
fuzzy_inner_join(bb, aa,
                   by = c("started" = "start",
                          "stopped" = "end"),
                   match_fun = list(`>=`, `<=`)
                 )

# A tibble: 2 x 5
  letters started             stopped             start               end                
  <chr>   <dttm>              <dttm>              <dttm>              <dttm>             
1 a       2019-05-02 07:33:14 2019-05-02 07:33:40 2019-05-02 07:33:13 2019-05-02 07:33:48
2 c       2019-05-02 07:44:36 2019-05-02 07:45:10 2019-05-02 07:43:26 2019-05-02 07:45:19
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Thank you very much! It is computationally expensive with 5000 rows though. I am very grateful, don't get me wrong! :) All the best! – eyei Jun 07 '19 at 20:58
1

There are two functions in the data.table package that would be useful for this.

The simplest is probably the inrange function:

Use inrange to identify rows in bb where either started or stopped are within any of the row-wise start-end intervals in aa.

library(data.table)

setDT(bb)
bb[started %inrange% aa | stopped %inrange% aa]

#    letters             started             stopped
# 1:       a 2019-05-02 07:30:23 2019-05-02 07:30:56
# 2:       a 2019-05-02 07:32:48 2019-05-02 07:33:14
# 3:       c 2019-05-02 07:33:14 2019-05-02 07:33:40
# 4:       c 2019-05-02 07:44:36 2019-05-02 07:45:10
# 5:       b 2019-05-02 07:45:11 2019-05-02 07:45:36

To get the counts you want, group by letters and return the number of occurrences:

bb[started %inrange% aa | stopped %inrange% aa, list(count = .N), by = letters]

#   letters count
#1:       a     2
#2:       c     2
#3:       b     1

The foverlaps function could also be used for this; more flexible but a bit more involved:

First set keys on aa and bb:

setDT(aa)
setkey(aa, start, end)
setDT(bb)
setkey(bb, started, stopped)

A simple call to foverlaps shows the result of a join, with NAs for rows in bb that don't match any intervals in aa.

foverlaps(aa, bb)

#                  start                 end letters             started             stopped
# 1: 2019-05-02 07:29:23 2019-05-02 07:30:35       a 2019-05-02 07:30:23 2019-05-02 07:30:56
# 2: 2019-05-02 07:30:16 2019-05-02 07:30:53       a 2019-05-02 07:30:23 2019-05-02 07:30:56
# 3:                <NA>                <NA>       a 2019-05-02 07:30:56 2019-05-02 07:31:29
# 4:                <NA>                <NA>       b 2019-05-02 07:31:29 2019-05-02 07:31:55
# 5:                <NA>                <NA>       d 2019-05-02 07:31:55 2019-05-02 07:32:22
# 6:                <NA>                <NA>       b 2019-05-02 07:32:22 2019-05-02 07:32:48
# 7: 2019-05-02 07:33:13 2019-05-02 07:33:48       a 2019-05-02 07:32:48 2019-05-02 07:33:14
# 8: 2019-05-02 07:33:13 2019-05-02 07:33:48       c 2019-05-02 07:33:14 2019-05-02 07:33:40
# 9: 2019-05-02 07:43:26 2019-05-02 07:45:19       c 2019-05-02 07:44:36 2019-05-02 07:45:10
# 10: 2019-05-02 07:44:59 2019-05-02 07:45:35       c 2019-05-02 07:44:36 2019-05-02 07:45:10
# 11: 2019-05-02 07:43:26 2019-05-02 07:45:19       b 2019-05-02 07:45:11 2019-05-02 07:45:36
# 12: 2019-05-02 07:44:59 2019-05-02 07:45:35       b 2019-05-02 07:45:11 2019-05-02 07:45:36
# 13:                <NA>                <NA>       c 2019-05-02 07:45:36 2019-05-02 07:46:01
# 14:                <NA>                <NA>       a 2019-05-02 07:46:01 2019-05-02 07:46:25
# 15:                <NA>                <NA>       c 2019-05-02 07:48:14 2019-05-02 07:48:48

To obtain only rows in bb that match an interval in aa, use set nomatch:

foverlaps(bb, aa, nomatch = NULL)

Similarly, to show each matching row only once, set mult:

foverlaps(bb, aa, nomatch = NULL, mult = "first")

#                   start                 end letters             started             stopped
# 1: 2019-05-02 07:29:23 2019-05-02 07:30:35       a 2019-05-02 07:30:23 2019-05-02 07:30:56
# 2: 2019-05-02 07:33:13 2019-05-02 07:33:48       a 2019-05-02 07:32:48 2019-05-02 07:33:14
# 3: 2019-05-02 07:33:13 2019-05-02 07:33:48       c 2019-05-02 07:33:14 2019-05-02 07:33:40
# 4: 2019-05-02 07:43:26 2019-05-02 07:45:19       c 2019-05-02 07:44:36 2019-05-02 07:45:10
# 5: 2019-05-02 07:43:26 2019-05-02 07:45:19       b 2019-05-02 07:45:11 2019-05-02 07:45:36

And you can count the matching occurrences of each letter by grouping on letters and counting rows:

foverlaps(aa, bb, nomatch = NULL, mult = "first")[ , list(count = .N), by = letters]

#   letters count
#1:       a     2
#2:       c     2
#3:       b     1
Chris Holbrook
  • 2,531
  • 1
  • 17
  • 30
  • Yes, thank you! I just applied your methods to a subset of the original dataset and the foverlaps() does exactly what I need. I'll check later how they behave on a larger dataset. But the important thing is that you the end result doesn't include only the %inrange% datetimes. Thank you as well for the explanations! All the best! – eyei Jun 08 '19 at 11:19
  • `bb[inrange(bb$started, aa$start, aa$end, incbounds = FALSE) | inrange(bb$stopped, aa$start, aa$end, incbounds = FALSE),list(count = .N), by = letters]` shouldn't give the same result as foverlaps()? I applied both to the subset of the original dataset and the results are different. – eyei Jun 08 '19 at 11:58
  • 1
    by default, foverlaps acts more like inrange with `incbounds = TRUE` and I am not aware of any argument in foverlaps that would allow the type of constraint set by `incbounds = FALSE` in inrange. From `?data.table::foverlaps`: *For type="any", as long as c<=b and d>=a, they overlap.* I am not aware of any foverlaps argument that would allow the overlap to be defined as *ca*. If that's important to you, then I think inrange is the way to go. – Chris Holbrook Jun 09 '19 at 00:34
  • I'll stick with `inrange` for now! Thank you! – eyei Jun 09 '19 at 19:22