Overwrite left_join dplyr to update data

Question

My question is similar to this one however I have additional columns in the LHS that should be kept https://stackoverflow.com/a/35642948/9285732

y is a subset of x with updated values for val1. In x I want to overwrite the relevant values but keep the rest.

Sample data:

library(tidyverse)

x <- tibble(name = c("hans", "dieter", "bohlen", "hans", "dieter", "alf"), 
            location = c(1,1,1,2,2,3), 
            val1 = 1:6, val2 = 1:6, val3 = 1:6)
y <- tibble(name = c("hans", "dieter", "hans"), 
            location = c(2,2,1), 
            val1 = 10)

> x
# A tibble: 6 x 5
  name   location  val1  val2  val3
  <chr>     <dbl> <int> <int> <int>
1 hans          1     1     1     1
2 dieter        1     2     2     2
3 bohlen        1     3     3     3
4 hans          2     4     4     4
5 dieter        2     5     5     5
6 alf           3     6     6     6

> y
# A tibble: 3 x 3
  name   location  val1
  <chr>     <dbl> <dbl>
1 hans          2    10
2 dieter        2    10
3 hans          1    10

> # desired output
> out
# A tibble: 6 x 5
  name   location  val1  val2  val3
  <chr>     <dbl> <dbl> <int> <int>
1 hans          1    10     1     1
2 dieter        1     2     2     2
3 bohlen        1     3     3     3
4 hans          2    10     4     4
5 dieter        2    10     5     5
6 alf           3     6     6     6

I wrote a function that is doing what I want, however it's quite cumbersome. I wonder if there's a more elegant way or even a dplyr function that I'm unaware of.

overwrite_join <- function(x, y, by = NULL){

  bycols     <- which(colnames(x) %in% by) 
  commoncols <- which(colnames(x) %in% colnames(y))
  extracols  <- which(!(colnames(x) %in% colnames(y)))

  x1 <- anti_join(x, y, by = by) %>% 
    bind_rows(y) %>%
    select(commoncols) %>% 
    left_join(x %>% select(bycols, extracols), by = by)

  out <- x %>% select(by) %>% 
    left_join(x1, by = by)

  return(out)
}

overwrite_join(t1, t2, by = c("name", "location"))

warnbergg · Accepted Answer · 2019-12-02T13:11:26.250

You could do something along the lines of

> x %>%
    left_join(y = y, by = c("name", "location")) %>%
    within(., val1.x <- ifelse(!is.na(val1.y), val1.y, val1.x)) %>%
    select(-val1.y)
# # A tibble: 6 x 5
#   name   location val1.x  val2  val3
#   <chr>     <dbl>  <dbl> <int> <int>
# 1 hans          1     10     1     1
# 2 dieter        1      2     2     2
# 3 bohlen        1      3     3     3
# 4 hans          2     10     4     4
# 5 dieter        2     10     5     5
# 6 alf           3      6     6     6

and then rename val1.x.

moodymudskipper · Answer 2 · 2019-12-02T13:01:09.297

My package safejoin might help. Only available on github so far but has a feature designed just for that.

The conflict argument below must be fed a function or lambda to deal with conflicting columns when joining, here we want in priority a value from the y data frame so we can use dplyr::coalesce() there. Note that we must first coerce y$val1 as in your example it's double while x$val1 is integer. Your real case might not need this step.

# remotes::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
y$val1 <- as.integer(y$val1)
safe_left_join(x, y, by = c("name", "location"), conflict = ~coalesce(.y, .x))
#> # A tibble: 6 x 5
#>   name   location  val1  val2  val3
#>   <chr>     <dbl> <int> <int> <int>
#> 1 hans          1    10     1     1
#> 2 dieter        1     2     2     2
#> 3 bohlen        1     3     3     3
#> 4 hans          2    10     4     4
#> 5 dieter        2    10     5     5
#> 6 alf           3     6     6     6

Edit : inspired by your own solution here's a 100% dplyr option that you might like better, just like your option though it's not a proper join!

bind_rows(y, x) %>%
  group_by(name, location) %>%
  summarize_all(~na.omit(.x)[[1]]) %>%
  ungroup()
#> # A tibble: 6 x 5
#>   name   location  val1  val2  val3
#>   <chr>     <dbl> <dbl> <int> <int>
#> 1 alf           3     6     6     6
#> 2 bohlen        1     3     3     3
#> 3 dieter        1     2     2     2
#> 4 dieter        2    10     5     5
#> 5 hans          1    10     1     1
#> 6 hans          2    10     4     4

score 2 · Answer 3 · answered Dec 02 '19 at 12:22

Try dplyr::coalesce

x %>%
  left_join(y, by = c("name", "location")) %>%
  mutate(val1 = coalesce(val1.y, val1.x)) %>%
  select(-val1.x, -val1.y)


# A tibble: 6 x 5
  name   location  val2  val3  val1
  <chr>     <dbl> <int> <int> <int>
1 hans          1     1     1    10
2 dieter        1     2     2     2
3 bohlen        1     3     3     3
4 hans          2     4     4    10
5 dieter        2     5     5    10
6 alf           3     6     6     6

mpettis · Answer 4 · 2019-12-12T19:50:12.307

This is the idiom I now use. It does not preserve the row or column order in x, if that is important.

I like it because I can evaluate the values to just before the bind_rows(), do a visual inspection, and if I like it, put the fixed rows back onto the base dataframe.

library(dplyr)

x <- tibble(name = c("hans", "dieter", "bohlen", "hans", "dieter", "alf"), 
            location = c(1,1,1,2,2,3), 
            val1 = 1:6, val2 = 1:6, val3 = 1:6)
y <- tibble(name = c("hans", "dieter", "hans"), 
            location = c(2,2,1), 
            val1 = 10)

keys <- c("name", "location")

out <- x %>%
    semi_join(y, keys) %>%
    select(-matches(setdiff(names(y), keys))) %>%
    left_join(y, keys) %>%
    bind_rows(x %>% anti_join(y, keys))

out %>%
    print()
#> # A tibble: 6 x 5
#>   name   location  val2  val3  val1
#>   <chr>     <dbl> <int> <int> <dbl>
#> 1 hans          1     1     1    10
#> 2 hans          2     4     4    10
#> 3 dieter        2     5     5    10
#> 4 dieter        1     2     2     2
#> 5 bohlen        1     3     3     3
#> 6 alf           3     6     6     6

^{Created on 2019-12-12 by the reprex package (v0.3.0)}

Overwrite left_join dplyr to update data

4 Answers4

Linked