I have a recurring problem with coalescing the non-NA values of duplicate columns after joins and deleting the duplicates. It is similar to what is described in this question, or this one. I wanted to create a small function around coalesce
(and potentially including left_join
) to deal with this in one line whenever I encounter it (the function itself can of course be as long as needed).
In doing so, I've run into the lack of quo_names
equivalent of quos
described here.
For a reprex, take a dataframe with identifying information to be joined with others that contain correct values but often misspelled IDs.
library(dplyr)
library(rlang)
iris_identifiers <- iris %>%
select(contains("Petal"), Species)
iris_alt_name1 <- iris %>%
mutate(Species = recode(Species, "setosa" = "stosa"))
iris_alt_name2 <- iris %>%
mutate(Species = recode(Species, "versicolor" = "verscolor"))
This simpler function works:
replace_xy <- function(df, var) {
x_var <- paste0(var, ".x")
y_var <- paste0(var, ".y")
df %>%
mutate(!! quo_name(var) := coalesce(!! sym(x_var), !! sym(y_var))) %>%
select(-(!! sym(x_var)), -(!! sym(y_var)))
}
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
replace_xy("Sepal.Length") %>%
replace_xy("Sepal.Width")
head(iris_full)
#> Petal.Length Petal.Width Species Sepal.Length Sepal.Width
#> 1 1.4 0.2 setosa 5.1 3.5
#> 2 1.4 0.2 setosa 4.9 3.0
#> 3 1.4 0.2 setosa 5.0 3.6
#> 4 1.4 0.2 setosa 4.4 2.9
#> 5 1.4 0.2 setosa 5.2 3.4
#> 6 1.4 0.2 setosa 5.5 4.2
But I am a bit lost as to how to implement the generalisation for several variables, which I thought would be the easier part. The snippet below is just a desperate attempt --after trying a number of variations-- which roughly captures what I'm trying to achieve.
replace_many_xy <- function(df, vars) {
x_var <- paste0(vars, ".x")
y_var <- paste0(vars, ".y")
df %>%
mutate_at(vars(vars), funs(replace_xy(.data, .))) %>%
select(-(!!! syms(x_var)), -(!!! syms(y_var)))
}
new_cols <- colnames(iris_alt_name1)
diff_cols <- new_cols [!(new_cols %in% colnames(iris_identifiers))]
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
replace_many_xy(diff_cols)
#> Warning: Column `Species` joining factors with different levels, coercing
#> to character vector
#> Warning: Column `Species` joining character vector and factor, coercing
#> into character vector
#> Error: Unknown columns `Sepal.Length` and `Sepal.Width`
Any help would be much appreciated!!