R dplyr full_join - no common key, need common columns to blend together

Question

I have these two dataframes for example:

dates = c('2020-11-19', '2020-11-20', '2020-11-21')
df1 <- data.frame(dates, area = c('paris', 'london', 'newyork'), 
                  rating = c(10, 5, 6),
                  rating2 = c(5, 6, 7))

df2 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'), 
                  rating = c(1, 2, 1))

> df1
       dates    area rating rating2
1 2020-11-19   paris     10       5
2 2020-11-20  london      5       6
3 2020-11-21 newyork      6       7
> df2
       dates     area rating
1 2020-11-19 budapest      1
2 2020-11-20   moscow      2
3 2020-11-21 valencia      1

When performing an outer join using dplyr:

df <- df1 %>%
  full_join(df2, by = c('dates', 'area'))

the result is like this:

       dates     area rating.x rating2 rating.y
1 2020-11-19    paris       10       5       NA
2 2020-11-20   london        5       6       NA
3 2020-11-21  newyork        6       7       NA
4 2020-11-19 budapest       NA      NA        1
5 2020-11-20   moscow       NA      NA        2
6 2020-11-21 valencia       NA      NA        1

i.e. the rating columns from the two dataframes are not blended together but two separate columns are created.

How do I get a result like this?

       dates     area rating   rating2 
1 2020-11-19    paris       10       5       
2 2020-11-20   london        5       6       
3 2020-11-21  newyork        6       7       
4 2020-11-19 budapest        1      NA        
5 2020-11-20   moscow        2      NA        
6 2020-11-21 valencia        1      NA

Thanks to the solutions provided by @kybazzi, the desired result was obtained.

df <- df1 %>%
  bind_rows(df2)

FOLLOW-UP

As a follow-up question, I would like to join the following to the joined dataframe:

df3 <- data.frame(dates, area = c('budapest', 'moscow', 'valencia'), 
                  rating2 = c(3, 2, 5))

Using the same method, the result is this:

> df_final <- df %>%
+     bind_rows(df3)
> df_final
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA
7 2020-11-19 budapest     NA       3
8 2020-11-20   moscow     NA       2
9 2020-11-21 valencia     NA       5

How do i get a result like this:

       dates     area rating   rating2 
1 2020-11-19    paris       10       5       
2 2020-11-20   london        5       6       
3 2020-11-21  newyork        6       7       
4 2020-11-19 budapest        1       3        
5 2020-11-20   moscow        2       2        
6 2020-11-21 valencia        1       5

Are there any keys (e.g. city names) that *are* shared between the two data frames? This looks like you could specify rating2 = NA for df2 and use `rbind` to call it a day rather than fiddling with merges — Dubukay, Dec 17 '21 at 18:18
potential duplicate https://stackoverflow.com/questions/27167151/merge-combine-columns-with-same-name-but-incomplete-data — Ben Bolker, Dec 17 '21 at 19:42

kybazzi · Accepted Answer · 2021-12-17T18:53:41.930

What you're looking for is dplyr::bind_rows(), which will preserve common columns and fill NA for columns that only exist in one of the data frames:

> bind_rows(df1, df2)
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

Note that you could also continue using full_join() - but you must ensure that all common columns between the data frames are included as keys if you don't want columns to be split:

> full_join(
+   df1, df2,
+   by = c("dates", "area", "rating")
+ )
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

The documentation for dplyr joins mentions:

Output columns include all x columns and all y columns. If columns in x and y have the same name (and aren't included in by), suffixes are added to disambiguate.

You could also avoid this issue by not specifying by, in which case dplyr will use all common columns.

> full_join(df1, df2)
Joining, by = c("dates", "area", "rating")
       dates     area rating rating2
1 2020-11-19    paris     10       5
2 2020-11-20   london      5       6
3 2020-11-21  newyork      6       7
4 2020-11-19 budapest      1      NA
5 2020-11-20   moscow      2      NA
6 2020-11-21 valencia      1      NA

As far as I know, both methods are good for your use case. In fact, I believe that the practical advantage full_join() has over bind_rows() is precisely this behaviour you wanted to avoid here, i.e. splitting columns that aren't keys.

For anyone unable to find it, the `bind_rows` function comes from the `dplyr` package! — Dubukay, Dec 17 '21 at 18:33
@kybazzi Always be sure to include all necessary libraries in your answer. — AndrewGB, Dec 17 '21 at 18:35

R dplyr full_join - no common key, need common columns to blend together

1 Answers1