3

I have two data sets:

table1 <- data.frame(id=c(1000,1001,1002,1003), 
                    date=as.POSIXct(c("2012-05-13","2012-09-23","2011-04-09","2014-11-08"))) 

table2 <- data.frame(id2=c(1000,1000,1001,1002,1003,1003), 
                   date2=as.POSIXct(c("2012-05-13","2012-05-16","2012-09-24","2011-04-15","2014-11-09", "2014-11-10"))) 

I want to do a left join on table1 based on matching ID and Date, however not all dates have an exact match so I was wondering how could I join the dates based on the closest day? For example for id 1001, "2012-09-23" would match "2012-09-24" for id2 1001 since it is the only date for id2 and for 1003 the "2014-11-08" would match "2014-11-09" of 1003 for id2 since it is the closest day.

Desired result:

  id       date      date2
1 1000 2012-05-13 2012-05-13
2 1001 2012-09-23 2012-09-24
3 1002 2011-04-09 2011-04-15
4 1003 2014-11-08 2014-11-09
Xin
  • 666
  • 4
  • 16
  • fuzzyjoin is your friend. look up https://cran.r-project.org/web/packages/fuzzyjoin/README.html and look at this post: https://community.rstudio.com/t/tidy-way-to-range-join-tables-on-an-interval-of-dates/7881/2 – infominer Feb 26 '19 at 18:53
  • this is even easier! left_join(table1,table2, by=c("id" = "id2")) – infominer Feb 26 '19 at 18:57
  • Possible duplicate of [Merge nearest date, and related variables from a another dataframe by group](https://stackoverflow.com/questions/28072542/merge-nearest-date-and-related-variables-from-a-another-dataframe-by-group) – milan Feb 26 '19 at 19:14

3 Answers3

4

I'd also recommend you to follow the non-equi data.table joins, but in case you'd like to for whatever reason stick with dplyr and your data isn't really big or you have enough memory, you could also try:

library(dplyr)

table1 %>%
  left_join(table2, by = c("id" = "id2")) %>%
  group_by(id) %>%
  slice(which.min(abs(date - date2)))

Output:

# A tibble: 4 x 3
# Groups:   id [4]
     id date                date2              
  <dbl> <dttm>              <dttm>             
1  1000 2012-05-13 00:00:00 2012-05-13 00:00:00
2  1001 2012-09-23 00:00:00 2012-09-24 00:00:00
3  1002 2011-04-09 00:00:00 2011-04-15 00:00:00
4  1003 2014-11-08 00:00:00 2014-11-09 00:00:00
arg0naut91
  • 14,574
  • 2
  • 17
  • 38
1

Use data.table to perform a rolling join to the nearest value. DT1 is updated by reference, so it shoul be very fast, even on large/big data

library(data.table)

sample data

dt1 <- as.data.table(table1)
dt2 <- as.data.table(table2)

code

dt1[, date2 := dt2[dt1, date2, on = c("id2 == id", "date2 == date"), roll = "nearest"]][]

output

#      id       date      date2
# 1: 1000 2012-05-13 2012-05-13
# 2: 1001 2012-09-23 2012-09-23
# 3: 1002 2011-04-09 2011-04-09
# 4: 1003 2014-11-08 2014-11-08
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • good answer but how would I account for the ID field? since the join is based on matching ID and the Date – Xin Feb 26 '19 at 19:42
1

With advent of dplyr 1.1.0 that has gained two helper functions join_by() and closest it is now pretty easy

table1 <- data.frame(id=c(1000,1001,1002,1003), 
                     date=as.POSIXct(c("2012-05-13","2012-09-23","2011-04-09","2014-11-08"))) 

table2 <- data.frame(id2=c(1000,1000,1001,1002,1003,1003), 
                     date2=as.POSIXct(c("2012-05-13","2012-05-16","2012-09-24","2011-04-15","2014-11-09", "2014-11-10"))) 
library(dplyr)
#> 

table1
#>     id       date
#> 1 1000 2012-05-13
#> 2 1001 2012-09-23
#> 3 1002 2011-04-09
#> 4 1003 2014-11-08
table2
#>    id2      date2
#> 1 1000 2012-05-13
#> 2 1000 2012-05-16
#> 3 1001 2012-09-24
#> 4 1002 2011-04-15
#> 5 1003 2014-11-09
#> 6 1003 2014-11-10

table1 %>% 
  left_join(table2, by = join_by(id == id2,
                                 closest(date <= date2)))
#>     id       date      date2
#> 1 1000 2012-05-13 2012-05-13
#> 2 1001 2012-09-23 2012-09-24
#> 3 1002 2011-04-09 2011-04-15
#> 4 1003 2014-11-08 2014-11-09

Created on 2023-04-07 with reprex v2.0.2

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45