1

I want to merge two data sets by ID and date. I want the dates in set B to be within 1 week of set A. I also want the earliest date (within 1 week) from set B to be merged to set A. Also, if there are multiple dates within 1 week, do not join multiple observations from set B to set A in a given range and ID.

What would the code look like for this? Would this be a left join if I wanted to prioritize set A?

Set A:

ID Date
1 10-21-2021
2 03-03-2020

Set B:

ID Date
1 10-22-2021
2 03-04-2020
1 10-23-2021

For example, the first two rows of set B would be joined to set A, but the final row would not because only one row within a week can be joined by each row in set A.

zephryl
  • 14,633
  • 3
  • 11
  • 30
  • Could you please provide some example data, as described [here](https://stackoverflow.com/a/5963610/17303805)? – zephryl Oct 28 '22 at 21:45
  • Example data is now given. – biostat_help Oct 28 '22 at 22:05
  • When >1 date is within a week, how do you want to pick which to keep? What if there are multiple equidistant dates, e.g. for 10-21-2021, both 10-20-2021 and 10-22-2021? – zephryl Oct 28 '22 at 22:22
  • Thank you for posing the question. Thankfully that won't but the case very often in my data set, but I would want to pick the dates that occur after in set B. So if there is 10-21-2021 in set A it would select 10-22-2021 in set B. – biostat_help Oct 28 '22 at 22:25

1 Answers1

0
  1. Merge in all set B dates for each ID.
  2. Compute the days between each pair of dates.
  3. Filter out dates > 7 days away.
  4. Within each ID and Date_A, select the row with the lowest Days_Diff and (in case of ties) latest Date_B.
library(dplyr)

set_A %>% 
  left_join(set_B, by = "ID", suffix = c("_A", "_B")) %>% 
  mutate(Days_Diff = abs(as.numeric(Date_B - Date_A, unit = "days"))) %>% 
  filter(Days_Diff <= 7) %>% 
  group_by(ID, Date_A) %>%
  slice(order(Days_Diff, desc(Date_B))[[1]]) %>%
  ungroup() %>% 
  select(!Days_Diff) # remove helper column

#> # A tibble: 2 × 5
#>      ID Date_A     Var_A Date_B     Var_B
#>   <dbl> <date>     <chr> <date>     <chr>
#> 1     1 2021-10-21 a     2021-10-22 W    
#> 2     2 2020-03-03 b     2020-03-04 X

Created on 2022-10-31 with reprex v2.0.2

Data

Note I added a few extra columns and test cases to the data provided by OP.

set_A <- tibble::tribble(
  ~ID,        ~Date, ~Var_A,
    1, "2021-10-21",    "a",
    2, "2020-03-03",    "b"
  ) %>% 
  mutate(Date = as.Date(Date))

set_B <- tibble::tribble(
  ~ID,        ~Date, ~Var_B,
    1, "2021-10-22",    "W",
    2, "2020-03-04",    "X",
    1, "2021-10-23",    "Y",
    2, "2020-03-02",    "Z"
  ) %>% 
  mutate(Date = as.Date(Date))
zephryl
  • 14,633
  • 3
  • 11
  • 30