0

I'm using R plyr's join() function to join two dataframes by columns, "country" and "year," over certain tax treaties. However, the "country" and "year" columns of the two dataframes carry different length with different number of unique values, something like the following. df1 =

country year    treaty_id
A   1995    1
B   1995    1
C   1995    1
D   1995    1
A   2001    2
B   2001    2
D   2001    2
C   2015    3
D   2015    3

while df2 has the complete lists of "country" and "year" values

country year
A   1992
B   1992
C   1992
D   1992
E   1992
F   1992
G   1992
H   1992
A   1993
B   1993
C   1993
…   …
G   2015
H   2015

for "country" A to H and "year" 1992 to 2015. Is there anyway to match df1 and df2 by the "year" information given in df1 (when a particular tax treaty was signed) and expand the list of country from A to H grouped by treaty_id. (Since there may be more than one treaties signed in the same year, so it needs to be grouped by treaty_id) Such that I can generate another dummy variable member to indicate each country's membership in a tax treaty. The output should look something like this

country year    treaty_id   member
A   1995    1   1
B   1995    1   1
C   1995    1   1
D   1995    1   1
E   1995    1   0
F   1995    1   0
G   1995    1   0
H   1995    1   0
A   2001    2   1
B   2001    2   1
C   2001    2   0
D   2001    2   1
E   2001    2   0
F   2001    2   0
G   2001    2   0
H   2001    2   0
…   …   …   
H   2015    n   1

It will really appreciated if anyone could enlighten me on this.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Chris T.
  • 1,699
  • 7
  • 23
  • 45
  • 2
    `merge(df2, df1, all.x=TRUE)` ? – jogo Mar 12 '18 at 10:14
  • The question I asked is entirely different from the one you flagged. Also, the `merge()` function did not address what I asked for. – Chris T. Mar 12 '18 at 12:47
  • `df2 <- expand.grid(country=LETTERS[1:8], year=1992:2015); df3 <- expand.grid(country=LETTERS[1:8], year=unique(df1$year)); df4 <- merge(df3, df1, all.x=TRUE); df4 <- df4[order(df4$year, df4$country),]; df4$member <- !is.na(df4$treaty_id)` and then: https://stackoverflow.com/questions/23583739/replace-na-value-with-the-group-value – jogo Mar 12 '18 at 13:20
  • `library("data.table"); setDT(df1); dtyear <- df1[, .(treaty_id=min(treaty_id)), year]; dt2 <- setorderv(CJ(country=LETTERS[1:8], year=df1[, year], unique = TRUE), c("year", "country")); dt2[df1, on=c("country", "year"), treaty_id:=i.treaty_id]; dt2[, member:=!is.na(treaty_id)][dtyear, on="year", treaty_id:=i.treaty_id][]` – jogo Mar 13 '18 at 07:43
  • Thanks for your reply and sorry for belated response. I applied `unique()` to get unique `treaty_id` and `year` combination (such that each treaty and the year it occurred is uniquely identified), and then using `merge(x, y, by = "treaty_id")` to fill in the corresponding year information " for each `treaty_id`. Could you cut-and-paste your earlier suggestion as a formal response under my question, so I can mark it as correct answer which might be useful for future readers? – Chris T. Mar 13 '18 at 17:10

0 Answers0