merge by nearest neighbour in group - R

Question

I have two country*year level datasets that cover the same countries but in different years. I would like to merge the two in a way that year is matched with its nearest neighbor, always within country (iso2code).

The first (dat1) looks like this (showing here only the head for AT, but iso2code has multiple different values):

  iso2code  year elect_polar_lrecon
  <chr>    <dbl>              <dbl>
1 AT        1999               2.48
2 AT        2002               4.18
3 AT        2006               3.66
4 AT        2010               3.91
5 AT        2014               4.01
6 AT        2019               3.55

The second (dat2) looks like this:

  iso2code  year affpol
  <chr>    <dbl>  <dbl>
1 AT        2008   2.47
2 AT        2013   2.49
3 DE        1998   2.63
4 DE        2002   2.83
5 DE        2005   2.89
6 DE        2009   2.09

In the end I would like to have something like (note that the value of affpol for 2008 could be matched both with 2010 and with 2006 as it is equally distant from both. If possible, I would go for the most recent date, as it is below):

  iso2code  year.1 elect_polar_lrecon year.2 affpol
  <chr>    <dbl>                <dbl>  <dbl>  <dbl>
1 AT        1999                 2.48  
2 AT        2002                 4.18  
3 AT        2006                 3.66  
4 AT        2010                 3.91  2008    2.47
5 AT        2014                 4.01  2013    2.49
6 AT        2019                 3.55

Not sure about how to do this... I am happy for a tidyverse solution, but really, all help is much appreciated!

Should be straightforward with a rolling join in `data.table`. See e.g. [Matching timestamped data to closest time in another dataset. Properly vectorized? Faster way?](https://stackoverflow.com/questions/31103897/matching-timestamped-data-to-closest-time-in-another-dataset-properly-vectorize/31104160#31104160). Just search the R tag for "data.table roll nearest" for several other nice posts. — Henrik, Jul 28 '21 at 16:19

score 1 · Accepted Answer · answered Jul 29 '21 at 07:22

As mentioned by Henrik, this can be solved by updating in a rolling join to the nearest which is available in the data.table package. Additionally, the OP has requested to go for the most recent date if matches are equally distant.

library(data.table)
setDT(dat1)[setDT(dat2), roll = "nearest", on = c("iso2code", "year"), 
     `:=`(year.2 = i.year, affpol = i.affpol)]

dat1

   iso2code year elect_polar_lrecon year.2 affpol
1:       AT 1999               2.48     NA     NA
2:       AT 2002               4.18     NA     NA
3:       AT 2006               3.66   2008   2.47
4:       AT 2010               3.91     NA     NA
5:       AT 2014               4.01   2013   2.49
6:       AT 2019               3.55     NA     NA

This operation has updated dat1 by reference, i.e., without copying the whole data object by adding two additional columns.

Now, the OP has requested to go for the most recent date if matches are equally distant but the join has picked the older date. Apparently, there is no parameter to control this in a rolling join to the nearest.

The workaround is to create a helper variable nyear which holds the negative year and to join on this:

setDT(dat1)[, nyear := -year][setDT(dat2)[, nyear := -year], 
                              roll = "nearest", on = c("iso2code", "nyear"), 
                             `:=`(year.2 = i.year, affpol = i.affpol)][
                               , nyear := NULL]
dat1

   iso2code year elect_polar_lrecon year.2 affpol
1:       AT 1999               2.48     NA     NA
2:       AT 2002               4.18     NA     NA
3:       AT 2006               3.66     NA     NA
4:       AT 2010               3.91   2008   2.47
5:       AT 2014               4.01   2013   2.49
6:       AT 2019               3.55     NA     NA

score 0 · Answer 2 · answered Jul 28 '21 at 17:36

I figured it out with the help of a friend. I leave it here in case anyone else is looking for a solution. Assuming that the first dataset is to_plot and the second is called to_plot2. Then:

find_nearest_year <- function(p_year, p_code){

years <- to_plot$year[to_plot$iso2code==p_code] 

nearest_year <- years[1]

for (i in sort(years, decreasing = TRUE)) {
  if (abs(i - p_year) < abs(nearest_year-p_year)) {
    nearest_year <- i
  }
}

return(nearest_year)
  
}

to_plot2 <- to_plot2 %>%
  group_by(iso2code, year) %>%
  mutate(matching_year=find_nearest_year(year, iso2code))

merged <- left_join(to_plot, to_plot2, by=c("iso2code", "year"="matching_year"))

merge by nearest neighbour in group - R

2 Answers2