4

I have two large dataframes, dfA and dfB, for which I have generated simple examples here

dfA = data.frame(id=c("Apple", "Banana", "Carrot", "Dates", "Egg"),
                    Answer_Date=as.Date(c("2013-12-07", "2014-12-07", "2015-12-07", "2016-12-07", "2017-12-07" )),
                    x1 = c(1,  2,  3,  4,  5),
                    x2 = c(10, 20, 30, 40, 50))

    Browse[2]> dfA
      id Answer_Date x1 x2
1  Apple  2013-12-07  1 10
2 Banana  2014-12-07  2 20
3 Carrot  2015-12-07  3 30
4  Dates  2016-12-07  4 40
5    Egg  2017-12-07  5 50

dfB = data.frame(id=c("Apple", "Apple", "Banana", "Banana", "Banana"),
                    Answer_Date=as.Date(c("2013-12-05", "2014-12-07", "2015-12-10", "2018-11-07", "2019-11-07" )),
                    x3 = c(5,  4,  3,  2,  1),
                    x4 = c(50, 40, 30, 20, 10))
Browse[2]> dfB
      id Answer_Date x3 x4
1  Apple  2013-12-05  5 50
2  Apple  2014-12-07  4 40
3 Banana  2014-12-10  3 30
4 Banana  2018-11-07  2 20
5 Banana  2019-11-07  1 10

I'd like to merge them by the closest date so that I get the items that exist in both dfA and dfB matched exactly by id and as closely as possible by Answer_Date (i.e minimum absolute value of date difference between the two dates). In this case I'd like to get

dfC
      id Answer_Date.x Answer_Date.y x1 x2 x3 x4
1  Apple    2013-12-07    2013-12-05  1 10  5 50
2 Banana    2014-12-07    2014-12-10  2 20  3 30

Unfortunately struggling with merge() and trying out various solutions that I have found on StackOverflow hasn't solved my problem and has only got me confused. Would someone kindly point me to the right solution, ideally with a simple explanation as to why it works?

Sincerely and with many thanks in advance

Thomas Philips

Thomas Philips
  • 935
  • 2
  • 11
  • 22
  • Have you tried these examples https://stackoverflow.com/questions/28072542/merge-nearest-date-and-related-variables-from-a-another-dataframe-by-group and https://stackoverflow.com/questions/49156597/merge-data-based-on-nearest-date-r – Ronak Shah Sep 05 '20 at 07:26
  • I did indeed, and got nowhere. The first example just generates an error message: `Error in `$<-.data.frame`(`*tmp*`, indices, value = c(1L, 3L, 4L, 5L, : replacement has 5 rows, data has 1`, while the second gives me 5 rows, Only Apple and Banana are taken from dfA, but all the values from dfB show up. – Thomas Philips Sep 05 '20 at 08:11
  • Got the first to work - thanks for the pointer! I had made a subtle error when copying the code. The second still does not work. I can't say that I understand it, but I'm going to go over line by line as I execute it – Thomas Philips Sep 05 '20 at 08:52
  • I think this is an excellent example from the OP; but I'm inclined to think this is a duplicate; the introduction of `dplyr::join_by()` in this answer makes me think it is a good one to keep: https://dplyr.tidyverse.org/dev/reference/join_by.html – Michael Roswell Apr 18 '23 at 14:22

1 Answers1

2

Left join dfB to dfA, take the difference between dates per row and choose the smallest diff per id.

left_join(dfA, dfB, by = "id") %>%
  mutate(date_diff = abs(Answer_Date.x - Answer_Date.y)) %>%
  group_by(id) %>%
  filter(date_diff == min(date_diff)) %>%
  select(id, Answer_Date.x, Answer_Date.y, starts_with("x"), date_diff)

Then output is:

# A tibble: 2 x 8
# Groups:   id [2]
  id     Answer_Date.x Answer_Date.y    x1    x2    x3    x4 date_diff
  <fct>  <date>        <date>        <dbl> <dbl> <dbl> <dbl> <drtn>   
1 Apple  2013-12-07    2013-12-05        1    10     5    50 2 days   
2 Banana 2014-12-07    2014-12-10        2    20     3    30 3 days 

By the way, in your sample code the third Answer_Date in the definition of dfB should be "2014-12-10" instead of "2015-12-10".

teru
  • 308
  • 2
  • 7