2

I have two datasets which have both the same row combinations Country & Year and I would like to add some columns from one dataset to the other one in a way that the row combinations match.

Dataset 1:

+----------+------+---------+---------+-----+
| Country  | Year | exports | imports | ... |
+----------+------+---------+---------+-----+
| Germany  | 2000 | 0.70    | 0.40    | ... |
| Germany  | 2001 | 0.68    | 0.41    | ... |
| Germany  | 2002 | 0.71    | 0.48    | ... |
| Germany  | 2003 | ...     | ...     | ... |
| Spain    | 2000 | 0.51    | 0.56    | ... |
| Spain    | 2001 | 0.48    | 0.50    | ... |
| Spain    | 2002 | 0.50    | 0.53    | ... |
| Spain    | 2003 | ...     | ...     | ... |
| ...      | ...  | ...     | ...     | ... |
+----------+------+---------+---------+-----+

Dataset 2:

+----------+-----+------+--------------+-------+-----+
| Country  | CC  | Year | unemployment | Pop   | ... |
+----------+-----+------+--------------+-------+-----+
| Germany  | GER | 2000 | 0.03         | 79.50 | ... |
| Germany  | GER | 2001 | 0.05         | 79.53 | ... |
| Germany  | GER | 2002 | 0.04         | 79.80 | ... |
| Germany  | GER | 2003 | ...          | ...   | ... |
| Hungary  | HUN | 2000 | ...          | ...   | ... |
| Hungary  | HUN | 2001 | ...          | ...   | ... |
| Hungary  | HUN | 2002 | ...          | ...   | ... |
| Hungary  | HUN | 2003 | ...          | ...   | ... |
| Spain    | ESP | 2000 | 0.08         | 40.2  | ... |
| Spain    | ESP | 2001 | 0.11         | 40.5  | ... |
| Spain    | ESP | 2002 | 0.10         | 40.55 | ... |
| Spain    | ESP | 2003 | ...          | ...   | ... |
| ...      | ... | ...  | ...          | ...   | ... |
+----------+-----+------+--------------+-------+-----+

I want the merged data to look like this:


+----------+-----+------+---------+---------+--------------+-------+-----+
| Country  | CC  | Year | exports | imports | unemployment | Pop   | ... |
+----------+-----+------+---------+---------+--------------+-------+-----+
| Germany  | GER | 2000 | 0.70    | 0.40    | 0.03         | 79.50 | ... |
| Germany  | GER | 2001 | 0.68    | 0.41    | 0.05         | 79.53 | ... |
| Germany  | GER | 2002 | 0.71    | 0.48    | 0.04         | 79.80 | ... |
| Germany  | GER | 2003 | ...     | ...     | ...          | ...   | ... |
| Spain    | ESP | 2000 | 0.51    | 0.56    | 0.08         | 40.2  | ... |
| Spain    | ESP | 2001 | 0.48    | 0.50    | 0.11         | 40.5  | ... |
| Spain    | ESP | 2002 | 0.50    | 0.53    | 0.10         | 40.55 | ... |
| Spain    | ESP | 2003 | ...     | ...     | ...          | ...   | ... |
| ...      | ... | ...  | ...     | ...     | ...          | ...   | ... |
+----------+-----+------+---------+---------+--------------+-------+-----+

So, the countries which are not in dataset 1 (like Hungary in this case) are not in the merged dataset and the country code is also in the new dataset. Could someone tell me how I can achieve this? I have 28 years for about 100 countries each. So using a function in which I have to specify every combination would not be handy...

I tried to merge it with merge()but did not succeed since it just created hundreds of rows with the same country and year combination.

Lila
  • 59
  • 6
  • 2
    "I tried to merge it with `merge()` but did not succeed..." What did you try exactly? `merge` is for sure a sensible option for this task, provided that it is used correctly. Share your attempts and describe why the result is not what you intended. – nicola May 06 '19 at 13:15

3 Answers3

1

You can do this with inner_join() from dplyr package

dplyr::inner_join(df1, df2, by=c("Country", "Year"))
MrNetherlands
  • 920
  • 7
  • 14
1

merge absolutely should work for this. You should specify that you are merging on two columns.

merge( df1 , df2 , by=c( "Country", "Year") )

Also confirm that the class of the merging vars is the same

sapply( df1[, c( "Country", "Year")] , class )
sapply( df2[, c( "Country", "Year")] , class )

confirm that the variables are spelled the same way in both data frames

intersect( names( df1 ) , names( df2 ))

Finally confirm that year and country are unique in both data.frames

sum( duplicated( df1[ ,c( "Country", "Year") ] ))
sum( duplicated( df2[ ,c( "Country", "Year") ] ))
MatthewR
  • 2,660
  • 5
  • 26
  • 37
0

The answer with merge() worked! Now I am facing the problem that e.g. Spain does not have any unemployment data for the year 2000. However, I still want to add all years of Spain and would like to have a NA in the unemployment column for Spain in 2000 in the merged dataset. How can I achieve this?

I tried to use merge(df1, df2, all.x = TRUE) but sometimes it just creates NA's for some reason...

Lila
  • 59
  • 6
  • try `merge(df1, df2, all=TRUE)`. Also see this [post](https://stackoverflow.com/questions/5965698/merge-unequal-dataframes-and-replace-missing-rows-with-0) – mnm May 08 '19 at 00:19