7

I have two data frames like the dummy samples , df1 (main data set) and df2. These data originally coming from two different log data.

df1 is always complete dataset from 02:00 midnight until 02:00 next day with different stream values during this period. But df2 only has value if there is an activity in the specific time period (between 02:00 and next day 02:00).

I was wondering how can I combine df1 and df2. My goal is to have similar complete dataset like df1 ( from 02:00 to next day 02:00) with additional variable - code - this variable can be empty if there is no activity/code in that start / end time in df2. And sometime the row in df1 needs to broken into 2 if there is an activity only in some part of that start / end time period

I am looking for the result similar to this. Showing only 4 rows just for example

id  STREAM_1    STREAM_2    STREAM_3    START                   END
401 NIVH-ON      IN            null     2022-08-16 7:43:30  2022-08-16 7:45:00
401 INVH-ON      OUT          12be4     2022-08-16 7:45:00  2022-08-16 7:49:00
401 INVH-ON      OUT          12be4     2022-08-16 7:49:00  2022-08-16 8:15:00

df1:

df1 <- tibble::tribble(
   ~id,   ~STREAM_1, ~STREAM_2,                ~START,                  ~END,
  401L,  "NIVH-OFF",      "IN",  "2022-08-16 2:00:00",  "2022-08-16 3:35:00",
  401L,  "NIVH-OFF",      "IN",  "2022-08-16 3:35:00",  "2022-08-16 7:22:45",
  401L,   "NIVH-ON",      "IN",  "2022-08-16 7:22:45",  "2022-08-16 7:31:45",
  401L, "NIVH-DOCK",      "IN",  "2022-08-16 7:31:45",  "2022-08-16 7:43:30",
  401L,   "NIVH-ON",      "IN",  "2022-08-16 7:43:30",  "2022-08-16 7:45:00",
  401L,   "INVH-ON",      "IN",  "2022-08-16 7:45:00",  "2022-08-16 7:49:00",
  401L,   "INVH-ON",     "OUT",  "2022-08-16 7:49:00",  "2022-08-16 8:50:00",
  401L,   "INVH-ON",      "IN",  "2022-08-16 8:50:00",  "2022-08-16 8:56:00",
  401L,   "NIVH-ON",      "IN",  "2022-08-16 8:56:00",  "2022-08-16 8:58:00",
  401L,   "INVH-ON",      "IN",  "2022-08-16 8:58:00",  "2022-08-16 9:00:00",
  401L,   "NIVH-ON",      "IN",  "2022-08-16 9:00:00",  "2022-08-16 9:24:00",
  401L,   "INVH-ON",      "IN",  "2022-08-16 9:24:00",  "2022-08-16 9:25:00",
  401L,   "NIVH-ON",      "IN",  "2022-08-16 9:25:00",  "2022-08-16 9:55:00",
  401L,   "NIVH-ON",     "OUT",  "2022-08-16 9:55:00", "2022-08-16 10:20:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 10:20:00", "2022-08-16 11:30:00",
  401L,   "NIVH-ON",     "OUT", "2022-08-16 11:30:00", "2022-08-16 12:00:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 12:00:00", "2022-08-16 12:03:00",
  401L,   "INVH-ON",      "IN", "2022-08-16 12:03:00", "2022-08-16 12:06:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 12:06:00", "2022-08-16 12:08:00",
  401L,   "INVH-ON",      "IN", "2022-08-16 12:08:00", "2022-08-16 12:11:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 12:11:00", "2022-08-16 12:33:00",
  401L,   "INVH-ON",      "IN", "2022-08-16 12:33:00", "2022-08-16 12:34:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 12:34:00", "2022-08-16 13:25:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 13:25:00", "2022-08-16 13:35:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 13:35:00", "2022-08-16 14:14:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 14:14:00", "2022-08-16 14:15:00",
  401L,   "INVH-ON",      "IN", "2022-08-16 14:15:00", "2022-08-16 14:19:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 14:19:00", "2022-08-16 15:04:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 15:04:00", "2022-08-16 15:09:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 15:09:00", "2022-08-16 15:24:00",
  401L,   "NIVH-ON",     "OUT", "2022-08-16 15:24:00", "2022-08-16 15:26:00",
  401L,   "INVH-ON",     "OUT", "2022-08-16 15:26:00", "2022-08-16 16:04:00",
  401L,   "INVH-ON",      "IN", "2022-08-16 16:04:00", "2022-08-16 16:07:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 16:07:00", "2022-08-16 16:13:00",
  401L,   "INVH-ON",      "IN", "2022-08-16 16:13:00", "2022-08-16 16:15:00",
  401L,   "NIVH-ON",      "IN", "2022-08-16 16:15:00", "2022-08-16 16:20:15",
  401L, "NIVH-DOCK",      "IN", "2022-08-16 16:20:15",  "2022-08-17 2:00:00"
  )

df2:

df2 <- tibble::tribble(
   ~id,   ~code,     ~date,    ~start_time,       ~endtime,
  401L, "12be4", 20220816L, "2022-08-16 07:45:15",  "2022-08-16 08:15:15",
  401L, "12be4", 20220816L, "2022-08-16 08:15:45",  "2022-08-16 08:21:15",
  401L, "12be4", 20220816L, "2022-08-16 08:21:30",  "2022-08-16 08:40:15",
  401L, "11e44", 20220816L, "2022-08-16 14:49:30",  "2022-08-16 15:17:30",
  401L, "12be4", 20220816L, "2022-08-16 15:24:30",  "2022-08-16 15:33:15",
  401L, "1fb84", 20220816L, "2022-08-16 15:33:15",  "2022-08-16 15:43:45",
  401L, "1fb84", 20220816L, "2022-08-16 15:46:30",  "2022-08-16 15:50:30",
  401L, "1b1c4", 20220816L, "2022-08-16 15:50:30",  "2022-08-16 16:02:45"
  )
Waldi
  • 39,242
  • 6
  • 30
  • 78
DanG
  • 689
  • 1
  • 16
  • 39
  • 1
    Have you considered merging the two data frames with `full_join` or `left_join` with the use of the `join_by` argument? See the documentation of the developers version of `dplyr` https://github.com/tidyverse/dplyr/blob/main/NEWS.md – Julien Aug 21 '22 at 14:33
  • 1
    You have to install the Github version of `dplyr` with `devtools::install_github("tidyverse/dplyr")` – Julien Aug 21 '22 at 15:18
  • Check this post: https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions – Maël Aug 26 '22 at 13:10

3 Answers3

5

You could use foverlaps from data.table package

library(data.table)

# convert to data.table
setDT(df1)
setDT(df2)

# Character dates to numeric dates
df1[,START:=as.POSIXct(START)]
df1[,END:=as.POSIXct(END)]
df2[,start_time:=as.POSIXct(start_time)]
df2[,endtime:=as.POSIXct(endtime)]

# key needed for second data.table
setkey(df2,id,start_time,endtime)

# Overlap join
foverlaps(df1,df2,by.x = c("id","START","END"))[
    ,.(id,STREAM_1,STREAM_2,STREAM_3=code,START,END)][ 
    ,.(STREAM_3=first(STREAM_3)),by=.(id,STREAM_1,STREAM_2,START,END)]

       id  STREAM_1 STREAM_2               START                 END STREAM_3
    <int>    <char>   <char>              <POSc>              <POSc>   <char>
 1:   401  NIVH-OFF       IN 2022-08-16 02:00:00 2022-08-16 03:35:00     <NA>
 2:   401  NIVH-OFF       IN 2022-08-16 03:35:00 2022-08-16 07:22:45     <NA>
 3:   401   NIVH-ON       IN 2022-08-16 07:22:45 2022-08-16 07:31:45     <NA>
 4:   401 NIVH-DOCK       IN 2022-08-16 07:31:45 2022-08-16 07:43:30     <NA>
 5:   401   NIVH-ON       IN 2022-08-16 07:43:30 2022-08-16 07:45:00     <NA>
 6:   401   INVH-ON       IN 2022-08-16 07:45:00 2022-08-16 07:49:00    12be4
 7:   401   INVH-ON      OUT 2022-08-16 07:49:00 2022-08-16 08:50:00    12be4
 8:   401   INVH-ON       IN 2022-08-16 08:50:00 2022-08-16 08:56:00     <NA>
 9:   401   NIVH-ON       IN 2022-08-16 08:56:00 2022-08-16 08:58:00     <NA>
10:   401   INVH-ON       IN 2022-08-16 08:58:00 2022-08-16 09:00:00     <NA>
...
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • your approach is what I am looking for but it seems not returning the correct result , row 7,8 and 9 are duplicate and not showing the correct output – DanG Aug 27 '22 at 08:55
  • see my update to remove duplicate codes – Waldi Aug 27 '22 at 12:48
2

Maybe you want to use a fuzzy_left_join from the package fuzzyjoin. Here you can join by multiple columns and specify the match_fun per column you want like "START" > "start_time" for example. You can replace the NAs with null. Here is a reproducible example:

library(tidyverse)
library(lubridate)
library(fuzzyjoin)
output <- df1 %>%
  mutate(START = ymd_hms(START),
         END = ymd_hms(END)) %>%
  fuzzy_left_join(df2 %>% mutate(start_time = ymd_hms(start_time),
                                 endtime = ymd_hms(endtime)),
    by = c("id" = "id", "START" = "start_time", "END" = "endtime"),
    match_fun = list(`==`, `<`, `>`)) %>%
  mutate(code = ifelse(is.na(code), "null", code))
print(output, n = 41)
#> # A tibble: 40 × 10
#>     id.x STREAM_1  STREAM_2 START               END                  id.y code 
#>    <int> <chr>     <chr>    <dttm>              <dttm>              <int> <chr>
#>  1   401 NIVH-OFF  IN       2022-08-16 02:00:00 2022-08-16 03:35:00    NA null 
#>  2   401 NIVH-OFF  IN       2022-08-16 03:35:00 2022-08-16 07:22:45    NA null 
#>  3   401 NIVH-ON   IN       2022-08-16 07:22:45 2022-08-16 07:31:45    NA null 
#>  4   401 NIVH-DOCK IN       2022-08-16 07:31:45 2022-08-16 07:43:30    NA null 
#>  5   401 NIVH-ON   IN       2022-08-16 07:43:30 2022-08-16 07:45:00    NA null 
#>  6   401 INVH-ON   IN       2022-08-16 07:45:00 2022-08-16 07:49:00    NA null 
#>  7   401 INVH-ON   OUT      2022-08-16 07:49:00 2022-08-16 08:50:00   401 12be4
#>  8   401 INVH-ON   OUT      2022-08-16 07:49:00 2022-08-16 08:50:00   401 12be4
#>  9   401 INVH-ON   IN       2022-08-16 08:50:00 2022-08-16 08:56:00    NA null 
#> 10   401 NIVH-ON   IN       2022-08-16 08:56:00 2022-08-16 08:58:00    NA null 
#> 11   401 INVH-ON   IN       2022-08-16 08:58:00 2022-08-16 09:00:00    NA null 
#> 12   401 NIVH-ON   IN       2022-08-16 09:00:00 2022-08-16 09:24:00    NA null 
#> 13   401 INVH-ON   IN       2022-08-16 09:24:00 2022-08-16 09:25:00    NA null 
#> 14   401 NIVH-ON   IN       2022-08-16 09:25:00 2022-08-16 09:55:00    NA null 
#> 15   401 NIVH-ON   OUT      2022-08-16 09:55:00 2022-08-16 10:20:00    NA null 
#> 16   401 NIVH-ON   IN       2022-08-16 10:20:00 2022-08-16 11:30:00    NA null 
#> 17   401 NIVH-ON   OUT      2022-08-16 11:30:00 2022-08-16 12:00:00    NA null 
#> 18   401 NIVH-ON   IN       2022-08-16 12:00:00 2022-08-16 12:03:00    NA null 
#> 19   401 INVH-ON   IN       2022-08-16 12:03:00 2022-08-16 12:06:00    NA null 
#> 20   401 NIVH-ON   IN       2022-08-16 12:06:00 2022-08-16 12:08:00    NA null 
#> 21   401 INVH-ON   IN       2022-08-16 12:08:00 2022-08-16 12:11:00    NA null 
#> 22   401 NIVH-ON   IN       2022-08-16 12:11:00 2022-08-16 12:33:00    NA null 
#> 23   401 INVH-ON   IN       2022-08-16 12:33:00 2022-08-16 12:34:00    NA null 
#> 24   401 NIVH-ON   IN       2022-08-16 12:34:00 2022-08-16 13:25:00    NA null 
#> 25   401 NIVH-ON   IN       2022-08-16 13:25:00 2022-08-16 13:35:00    NA null 
#> 26   401 NIVH-ON   IN       2022-08-16 13:35:00 2022-08-16 14:14:00    NA null 
#> 27   401 NIVH-ON   IN       2022-08-16 14:14:00 2022-08-16 14:15:00    NA null 
#> 28   401 INVH-ON   IN       2022-08-16 14:15:00 2022-08-16 14:19:00    NA null 
#> 29   401 NIVH-ON   IN       2022-08-16 14:19:00 2022-08-16 15:04:00    NA null 
#> 30   401 NIVH-ON   IN       2022-08-16 15:04:00 2022-08-16 15:09:00    NA null 
#> 31   401 NIVH-ON   IN       2022-08-16 15:09:00 2022-08-16 15:24:00    NA null 
#> 32   401 NIVH-ON   OUT      2022-08-16 15:24:00 2022-08-16 15:26:00    NA null 
#> 33   401 INVH-ON   OUT      2022-08-16 15:26:00 2022-08-16 16:04:00   401 1fb84
#> 34   401 INVH-ON   OUT      2022-08-16 15:26:00 2022-08-16 16:04:00   401 1fb84
#> 35   401 INVH-ON   OUT      2022-08-16 15:26:00 2022-08-16 16:04:00   401 1b1c4
#> 36   401 INVH-ON   IN       2022-08-16 16:04:00 2022-08-16 16:07:00    NA null 
#> 37   401 NIVH-ON   IN       2022-08-16 16:07:00 2022-08-16 16:13:00    NA null 
#> 38   401 INVH-ON   IN       2022-08-16 16:13:00 2022-08-16 16:15:00    NA null 
#> 39   401 NIVH-ON   IN       2022-08-16 16:15:00 2022-08-16 16:20:15    NA null 
#> 40   401 NIVH-DOCK IN       2022-08-16 16:20:15 2022-08-17 02:00:00    NA null 
#> # … with 3 more variables: date <int>, start_time <dttm>, endtime <dttm>

Created on 2022-08-25 with reprex v2.0.2

Please note: Finally, you can tidy up the columns if you want.

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • I looked at your reproducible example but it seems something is not correct, row 14 for example (2022-08-16 9:55:00 - 2022-08-16 10:20:00 ) matched with 2022-08-16 14:49:30 - 2022-08-16 15:17:30) . Am I missing something here? The thing is non of `==`, `>`, `<` will return correct output – DanG Aug 25 '22 at 11:56
  • @DanG, my bad, Now it should be right. I created the times in date format and swapped the `match_fun` so it is now in the right interval. Is this what you mean? – Quinten Aug 27 '22 at 09:22
0

Adapting @Quinten's answer for the dev version of dplyr:

#devtools::install_github("tidyverse/dplyr") # v1.0.99.9000 @ 2022-08-25
library(dplyr)
left_join(df1 %>% mutate(across(START:END, lubridate::ymd_hms)),
          df2 %>% mutate(across(start_time:endtime, lubridate::ymd_hms)),
          join_by(id == id, START < start_time, END > endtime))
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Thanks, but I have difficulty to install the dev version of dplyr - Failed to install 'dplyr' from GitHub: JSON: EXPECTED value GOT EOF – DanG Aug 27 '22 at 09:02
  • 1
    I don't know the release schedule but hopefully will be updated on CRAN soon. In the meantime, this issue for a different library notes some potential solutions for that error : https://github.com/r-lib/remotes/issues/324 – Jon Spring Aug 28 '22 at 21:27