3

When you use a dplyr join function like full_join, columns with identical names are duplicated and given suffixes like "col.x", "col.y", "col.x.x", etc. when they are not used to join the tables.

library(dplyr)
data1<-data.frame(
          Code=c(2,1,18,5), 
          Country=c("Canada", "USA", "Brazil", "Iran"), 
          x=c(50,29,40,29))
data2<-data.frame(
          Code=c(2,40,18),
          Country=c("Canada","Japan","Brazil"), 
          y=c(22,30,94))
data3<-data.frame(
          Code=c(25,14,52),
          Country=c("China","Japan","Australia"), 
          z=c(22,30,94))

data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3)) 

This results in "Country", "Country.x", and "Country.y" columns.

Is there a way to combine the three columns into one, such that if a row has NA for a "Country", it takes the value from "Country.x" or "Country.y"?

I attempted a solution based on this similar question, but it gives me a warning and returns only values from the top three rows.

 data4<-Reduce(function(...) full_join(..., by="Code"), list(data1,data2,data3)) %>%
        mutate(Country=coalesce(Country.x,Country.y,Country)) %>%
        select(-Country.x, -Country.y)

This returns the warning invalid factor level, NA generated.

Any ideas?

Community
  • 1
  • 1
la_leche
  • 423
  • 1
  • 7
  • 14
  • 1
    You can remove the `by=Code` to get your desired output. Then, it will join base on `Code` and `Country`. The warning is because the class of `Country` is `factor`. You can change it to `character` to prevent the error. – JasonWang Mar 23 '17 at 23:18
  • Add `stringsAsFactors = FALSE` when you create the data frame using `data.frame`, or you can use `data_frame` from the `dplyr` package to prevent columns to be `factor`. – www Mar 23 '17 at 23:21

1 Answers1

1

You could use my package safejoin, make a full join and deal with the conflicts using dplyr::coalesce.

First we'll have to rename the tables to have value columns named the same.

library(dplyr)
data1 <- rename_at(data1,3, ~"value")
data2 <- rename_at(data2,3, ~"value")
data3 <- rename_at(data3,3, ~"value")

Then we can join

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
data1 %>%
  safe_full_join(data2, by = c("Code","Country"), conflict = coalesce) %>%
  safe_full_join(data3, by = c("Code","Country"), conflict = coalesce)
#   Code   Country value
# 1    2    Canada    50
# 2    1       USA    29
# 3   18    Brazil    40
# 4    5      Iran    29
# 5   40     Japan    30
# 6   25     China    22
# 7   14     Japan    30
# 8   52 Australia    94

You get some warnings because you're joining factor columns with different levels, add parameter check="" to remove them.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167