3

I have a recurring problem with coalescing the non-NA values of duplicate columns after joins and deleting the duplicates. It is similar to what is described in this question, or this one. I wanted to create a small function around coalesce (and potentially including left_join) to deal with this in one line whenever I encounter it (the function itself can of course be as long as needed).

In doing so, I've run into the lack of quo_names equivalent of quos described here.

For a reprex, take a dataframe with identifying information to be joined with others that contain correct values but often misspelled IDs.

library(dplyr)
library(rlang)

iris_identifiers <- iris %>% 
  select(contains("Petal"), Species)

iris_alt_name1 <- iris %>% 
  mutate(Species = recode(Species, "setosa" = "stosa")) 

iris_alt_name2 <- iris %>%
  mutate(Species = recode(Species, "versicolor" = "verscolor"))

This simpler function works:

replace_xy <- function(df, var) {

  x_var <- paste0(var, ".x")
  y_var <- paste0(var, ".y")

  df %>% 
    mutate(!! quo_name(var) := coalesce(!! sym(x_var), !! sym(y_var))) %>% 
    select(-(!! sym(x_var)), -(!! sym(y_var)))

}


iris_full <- iris_identifiers %>% 
  left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>% 
  left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>% 
  replace_xy("Sepal.Length") %>% 
  replace_xy("Sepal.Width")


head(iris_full)
#>   Petal.Length Petal.Width Species Sepal.Length Sepal.Width
#> 1          1.4         0.2  setosa          5.1         3.5
#> 2          1.4         0.2  setosa          4.9         3.0
#> 3          1.4         0.2  setosa          5.0         3.6
#> 4          1.4         0.2  setosa          4.4         2.9
#> 5          1.4         0.2  setosa          5.2         3.4
#> 6          1.4         0.2  setosa          5.5         4.2

But I am a bit lost as to how to implement the generalisation for several variables, which I thought would be the easier part. The snippet below is just a desperate attempt --after trying a number of variations-- which roughly captures what I'm trying to achieve.

replace_many_xy <- function(df, vars) {

  x_var <- paste0(vars, ".x")
  y_var <- paste0(vars, ".y")

  df %>% 
    mutate_at(vars(vars), funs(replace_xy(.data, .))) %>% 
    select(-(!!! syms(x_var)), -(!!! syms(y_var)))

}

new_cols <- colnames(iris_alt_name1)
diff_cols <- new_cols [!(new_cols %in% colnames(iris_identifiers))]

iris_full <- iris_identifiers %>% 
  left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>% 
  left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>% 
  replace_many_xy(diff_cols)
#> Warning: Column `Species` joining factors with different levels, coercing
#> to character vector

#> Warning: Column `Species` joining character vector and factor, coercing
#> into character vector
#> Error: Unknown columns `Sepal.Length` and `Sepal.Width`

Any help would be much appreciated!!

Fons MA
  • 1,142
  • 1
  • 12
  • 21

1 Answers1

5

We can use {powerjoin} :

library(powerjoin)
iris_full <- iris_identifiers %>%
  left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
  power_left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width"), conflict  = coalesce_xy) %>%
  head()

iris_full
#   Petal.Length Petal.Width Species Sepal.Length Sepal.Width
# 1          1.4         0.2  setosa          5.1         3.5
# 2          1.4         0.2  setosa          4.9         3.0
# 3          1.4         0.2  setosa          5.0         3.6
# 4          1.4         0.2  setosa          4.4         2.9
# 5          1.4         0.2  setosa          5.2         3.4
# 6          1.4         0.2  setosa          5.5         4.2

power_left_join is an improved left_join that allows among other things some ways to deal with column conflicts through the conflict argument as we did here.

The conflict argument is a function that takes pairs of conflicted columns one after another, to coalesce from the right can use need conflict = coalesce_yx


And here is a way to make your function work:

replace_many_xy <- function(tbl, vars){
  for(var in vars){
    x <- paste0(var,".x")
    y <-  paste0(var,".y")
    tbl <- mutate(tbl, !!sym(var) := coalesce(!!sym(x) , !!sym(y) )) %>%
     select(-one_of(x,y))
  }
  tbl
}
iris_full <- iris_identifiers %>%
  left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
  left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
  replace_many_xy(diff_cols) %>% as_tibble()
# # A tibble: 372 x 5
#    Petal.Length Petal.Width Species Sepal.Length Sepal.Width
#           <dbl>       <dbl> <chr>          <dbl>       <dbl>
#  1          1.4         0.2 setosa           5.1         3.5
#  2          1.4         0.2 setosa           4.9         3  
#  3          1.4         0.2 setosa           5           3.6
#  4          1.4         0.2 setosa           4.4         2.9
#  5          1.4         0.2 setosa           5.2         3.4
#  6          1.4         0.2 setosa           5.5         4.2
#  7          1.4         0.2 setosa           4.6         3.2
#  8          1.4         0.2 setosa           5           3.3
#  9          1.4         0.2 setosa           5.1         3.5
# 10          1.4         0.2 setosa           4.9         3  
# # ... with 362 more rows
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Thanks Moody! Your package sounds quite interesting, I've downloaded it and will be testing it soon! Sorry I was away from this project for a couple of days... I'm not sure I understand the last `mutate_at`... and I get the following: "Error in mutate_impl(.data, dots) : Column `Sepal.Length.x` is of unsupported type NULL". What am I missing? – Fons MA Feb 27 '19 at 06:38
  • 1
    Thanks @Fons-MA , I'm still working on the features and interface and feedback would be extremely precious. As for the issue with `mutate_at` I've experienced it later but forgot I had used it here, it doesn't work with all versions of dplyr and is supposed to drop the suffixed colums. I'll update it with something more robust asap – moodymudskipper Feb 27 '19 at 07:33
  • 1
    There you go, I replaced it with `select(-one_of(x,y))` which I think is stable among versions – moodymudskipper Feb 27 '19 at 12:55
  • Ah, of course... I was not thinking properly... I updated `dplyr` from 0.7 to 0.8 and it all works but I think I'd use a `select` drop just for clarity. Any reasons to prefer the `mutate` option? On your package, I only wish I'd seen it a few months ago! Most of the mergers are done now (with horrible checking functions I concocted as I went), so I may only use it in the near future if I revise some parts of the code. I'll try to keep you updated. – Fons MA Feb 27 '19 at 23:31
  • I'm just used to set columns to `NULL` to remove them, and the least embedded parentheses the better, but this example shows that it's bad practice to do it with `mutate_at` as it's not robust to not so old versions. – moodymudskipper Feb 28 '19 at 00:04