3

I would like to merge two datasets by ID. The date in dataset1 should match only the nearest date in dataset2. I want all dates from dataset1 to be included in the merge.

        dataset1 <- read.table(text="ID  Date     
A   2021-03-18  
A   2021-04-27
A   2021-04-05
A   2021-05-02
A   2021-02-08
A   2021-06-02
A   2021-05-29 ", header=TRUE)

        dataset2 <- read.table(text="ID  Date     
A   2021-01-01  
A   2021-01-01
A   2021-05-02
A   2021-05-09
A   2021-05-09
A   2021-05-09
A   2021-05-09
A   2021-06-16
A   2021-06-27 ", header=TRUE)
YASIR AA
  • 101
  • 3

2 Answers2

4

dplyr option using join_by with closest like this:

devtools::install_github("tidyverse/dplyr")
library(dplyr)
by <- join_by(ID, closest(Date > Date))
left_join(dataset2, dataset1, by)
#>   ID     Date.x     Date.y
#> 1  A 2021-01-01       <NA>
#> 2  A 2021-01-01       <NA>
#> 3  A 2021-05-02 2021-04-27
#> 4  A 2021-05-09 2021-05-02
#> 5  A 2021-05-09 2021-05-02
#> 6  A 2021-05-09 2021-05-02
#> 7  A 2021-05-09 2021-05-02
#> 8  A 2021-06-16 2021-06-02
#> 9  A 2021-06-27 2021-06-02

Created on 2022-10-19 with reprex v2.0.2

Please note: Make sure to install the right version using devtools::install_github("tidyverse/dplyr")!


To get nearest in both ways you could use the following chain:

library(dplyr)
dataset2 %>% 
  left_join(., dataset1, join_by(ID, closest(Date >= Date))) %>%
  left_join(., dataset1, join_by(ID, closest(Date.x <= Date))) %>%
  mutate(Date.y = ifelse(is.na(Date.y), Date, Date.y)) %>%
  select(-Date)
#>   ID     Date.x     Date.y
#> 1  A 2021-01-01 2021-02-08
#> 2  A 2021-01-01 2021-02-08
#> 3  A 2021-05-02 2021-05-02
#> 4  A 2021-05-09 2021-05-02
#> 5  A 2021-05-09 2021-05-02
#> 6  A 2021-05-09 2021-05-02
#> 7  A 2021-05-09 2021-05-02
#> 8  A 2021-06-16 2021-06-02
#> 9  A 2021-06-27 2021-06-02

Created on 2022-10-19 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53
3

A data.table option with roll = "nearest":

setDT(dataset1)[, c("Date", "Date1") := as.Date(Date)]
setDT(dataset2)[, c("Date", "nearest") := as.Date(Date)]
dataset2[dataset1, on = .(ID, Date), roll = "nearest"][, Date := NULL][]

   ID    nearest      Date1
1:  A 2021-05-02 2021-03-18
2:  A 2021-05-02 2021-04-27
3:  A 2021-05-02 2021-04-05
4:  A 2021-05-02 2021-05-02
5:  A 2021-01-01 2021-02-08
6:  A 2021-06-16 2021-06-02
7:  A 2021-06-16 2021-05-29

Other option to match number of rows:

dataset1[dataset2, on = .(ID, Date), roll = "nearest"][, Date := NULL][]
   ID      Date1    nearest
1:  A 2021-02-08 2021-01-01
2:  A 2021-02-08 2021-01-01
3:  A 2021-05-02 2021-05-02
4:  A 2021-05-02 2021-05-09
5:  A 2021-05-02 2021-05-09
6:  A 2021-05-02 2021-05-09
7:  A 2021-05-02 2021-05-09
8:  A 2021-06-02 2021-06-16
9:  A 2021-06-02 2021-06-27
Maël
  • 45,206
  • 3
  • 29
  • 67