Joining data frames that share columns, but differ in their observation units

Question

Unfortunetely, I don't know how to produce a hypothetical dataset to illustrate my question. So I'll just describe what I'm trying to do and hope someone understands.

I have two datasets.

df1: it records which political party is in power for each country in each year. For example: the Republican Party was in power from 2017 to 2021 in US.
df2: it records how Political Parties changes its profile through time. For example: it records that party A is ideologically leftist in 1970 and center-leftist in 1980. This df does not have observations for all years. For example, one does not have data about party A's ideology in 1971, it jumps from 1970 directly into 1980.

What I'm trying to do is extracting party's ideology from df2 to df1.

For example, df1 looks like this:

Country | Year | Government's Political Party ID
X       | 1990 | 340
X       | 1991 | 340
X       | 1992 | 340
X       | 1993 | 340

df2 looks like this:

Country | Year | Political Party ID | Ideology
X       | 1970 | 340                | center
X       | 1985 | 340                | center
X       | 1992 | 340                | center-left
X       | 1999 | 340                | center-left

df1 and df2 use the same codification of Political Parties. So you may be thinking: well, why not just perform dplyr::left_join() ?

If I do this, my data will look like this:

Country | Year | Government's Political Party ID | Ideology
X       | 1990 | 340                             | NA
X       | 1991 | 340                             | NA
X       | 1992 | 340                             | center-left
X       | 1993 | 340                             | NA

It would match country, year and ID but produce an unsatisfactory result.

Instead, I want my result to look like this:

Country | Year | Government's Political Party ID | Ideology
X       | 1990 | 340                             | center
X       | 1991 | 340                             | center-left
X       | 1992 | 340                             | center-left
X       | 1993 | 340                             | center-left

df2 does not have an X-1990 observation, but it can be inferred that 340's ideology has been center since 1985.

How do I perform this?

It seems like you are looking for joining based on ranges of years right? [Join specifications](https://dplyr.tidyverse.org/reference/join_by.html) or these answers might help you [1](https://stackoverflow.com/questions/51899352/r-dplyr-join-on-range-of-dates?noredirect=1&lq=1) ; [2](https://stackoverflow.com/questions/12030932/rolling-joins-data-table-in-r). It might be helpful to make the example data you show as a `data.table` and provide the output of `dput()` to help someone quickly load it up to help answer this — Prashant Bharadwaj, Aug 08 '23 at 22:43
Just to clarify "This df does not have observations for all years, because it only documents changes" - in your example `df2` Ideology does not change in 1985 - is this intentional? And the column names for political party ID differ between `df1` and `df2` as in your examples? — neilfws, Aug 08 '23 at 22:47
Yes, It is intentional, sorry, mu explanation was a bit confusing. The columns differ in my example and in my datasets. But this is not a major problem because I can rename the columns before performing any join. — Pedro Cardoso, Aug 08 '23 at 23:02
OK. It will help to make the example data consistent. Currently years 1985-1991 in `df2` are "center", but 1991 in the example desired output is "center-left". — neilfws, Aug 08 '23 at 23:09

Prashant Bharadwaj · Answer 1 · 2023-08-08T23:19:23.053

Try this as the starting point - I used rolling join within the by = option of left_join to match the closest year in data 1 that is higher than data 2's year.

And do try to provide data for people to reproduce your question in the future. I copy pasted your data table, and hand-made to R format.

library(tidyverse)

df1 <- 
  tribble(
  ~Country, ~Year1, ~`Government_Political Party ID`,
'x'      , 1990, 340,
'x'      , 1991, 340,
'x'      , 1992, 340,
'x'      , 1993, 340)


df2 <- 
  tribble(
    ~Country, ~Year2, ~`Political Party ID`, ~Ideology,
    'x'      , 1970, 340               , 'center',
    'x'      , 1985, 340               , 'center',
    'x'      , 1992, 340               , 'center-left',
    'x'      , 1999, 340               , 'center-left'
  )

left_join(df1, df2,
          by = join_by(Country, 
                       'Government_Political Party ID' == 'Political Party ID',
                       closest(Year1 >= Year2)))
#> # A tibble: 4 × 5
#>   Country Year1 `Government_Political Party ID` Year2 Ideology   
#>   <chr>   <dbl>                           <dbl> <dbl> <chr>      
#> 1 x        1990                             340  1985 center     
#> 2 x        1991                             340  1985 center     
#> 3 x        1992                             340  1992 center-left
#> 4 x        1993                             340  1992 center-left

^{Created on 2023-08-08 with reprex v2.0.2}

This is great. Thank you very much! I cant believe there was an answear so simple as that. I love you, bro. I love you. — Pedro Cardoso, Aug 09 '23 at 00:28
Glad I could help! Do take a look at the other answers up here and choose one as the correct answer so this question is marked as answered to help future people stumbling upon it (_by clicking on the tick/check mark_) — Prashant Bharadwaj, Aug 09 '23 at 02:38

score 0 · Answer 2 · answered Aug 08 '23 at 23:07

Here is one possible solution using data.table's roll argument.

This takes into consideration only the times where the ideology has actually changed. In your expected result where Year = 1991, I believe it should still be "center" until there was a change to "center-left" in 1992.

The example below may need more fine tuning depending on your actual data.

library(data.table)
df1 = data.table(Country = rep("X", 4),
                 Year = 1990:1993,
                 PP_ID = rep(340, 4))

df2 <- data.table(Country = rep("X", 4),
                 Year = c(1970, 1985, 1992, 1999),
                 PP_ID = rep(340, 4),
                 Ideology = c("center","center", "center-left", "center-left"))

df2[df1, .(Country, Year, PP_ID = i.PP_ID, Ideology), on = .(Country,Year), roll=TRUE]

   Country Year PP_ID    Ideology
1:       X 1990   340      center
2:       X 1991   340      center
3:       X 1992   340 center-left
4:       X 1993   340 center-left

score 0 · Answer 3 · answered Aug 08 '23 at 23:16

One way is to expand the range of years in df2 using tidyr::complete(), then use tidyr::fill() to fill the other variables, and then do the join.

library(dplyr)
library(tidyr)

df2 %>% 
  complete(Year = seq(min(Year), max(Year), 1)) %>% 
  fill(Country, `Political Party ID`, Ideology) %>% 
  right_join(df1, 
             by = c("Year", 
                    "Country", 
                    "Political Party ID" = "Government's Political Party ID"))

Result (using your original examples where 1991 would be "center"):

# A tibble: 4 × 4
   Year Country `Political Party ID` Ideology   
  <dbl> <chr>                  <dbl> <chr>      
1  1990 X                        340 center     
2  1991 X                        340 center     
3  1992 X                        340 center-left
4  1993 X                        340 center-left

Data:

df1 <- structure(list(Country = c("X", "X", "X", "X"), 
                      Year = 1990:1993, 
                      `Government's Political Party ID` = c(340, 340, 340, 340)),
                 class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(Country = c("X", "X", "X", "X"), 
                      Year = c(1970, 1985, 1992, 1999), 
                      `Political Party ID` = c(340, 340, 340, 340), 
                      Ideology = c("center", "center", "center-left", "center-left")), 
                 class = "data.frame", row.names = c(NA, -4L))

Joining data frames that share columns, but differ in their observation units

3 Answers3