1

I have a joining problem that I'm struggling with in that the join IDs I want to use for separate dataframes are spread out across three possible ID columns. I'd like to be able to join if at least one join ID matches. I know the _join and merge functions accept a vector of column names but is it possible to make this work conditionally?

For example, if I have the following two data frames:

df_A <- data.frame(dta = c("FOO", "BAR", "GOO"),
                   id1 = c("abc", "", "bcd"),
                   id2 = c("", "", "xyz"),
                   id3 = c("def", "fgh", ""), stringsAsFactors = F)


df_B <- data.frame(dta = c("FUU", "PAR", "KOO"),
                   id1 = c("abc", "", ""),
                   id2 = c("", "xyz", "zzz"),
                   id3 = c("", "", ""), stringsAsFactors = F)


> df_A
 dta id1 id2 id3
1 FOO abc     def
2 BAR         fgh
3 GOO bcd xyz   

> df_B
  dta id1 id2 id3
1 FUU abc        
2 PAR     xyz    
3 KOO     zzz  

I hope to end up with something like this:

 dta.x dta.y id1  id2  id3  
1 FOO  FUU   abc  ""   def    [matched on id1]
2 BAR  ""    ""   ""   fgh      [unmatched]
3 GOO  PAR   bcd  xyz  ""    [matched on id2]
4 KOO  ""    ""   zzz  ""      [unmatched]

So that unmatched dta1 and dta1 variables are retained but where there is a match (row 1 + 3 above) both dta1 and dta2 are joined in the new table. I have a sense that neither _join, merge, or match will work as is and that I'd need to write a function but I'm not sure where to start. Any help or ideas appreciated. Thank you

  • @akrun I saw the post, [https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right?noredirect=1&lq=1]() already but it doesn't address my question of how to join on at least one of several join IDs. I had tried sql style joins but wasn't getting to where I needed to. If someone can point out what answer in that post is appropriate here, I'll accept this as a dup. Thanks – pocketprotector Oct 10 '19 at 18:53

2 Answers2

1

Basically, what you want to do is join by corresponding IDs, what you can do is to convert the original id columns to id_column and id_value, because you don't want to join with "", do I dropped it.

library(tidyverse)
df_A_long <- df_A %>%
    pivot_longer(
        cols = -dta,
        names_to = "id_column",
        values_to = "id_value"
    ) %>%
    dplyr::filter(id_value != "")


df_B_long <- df_B %>%
    pivot_longer(
        cols = -dta,
        names_to = "id_column",
        values_to = "id_value"
    ) %>%
    dplyr::filter(id_value != "")

We always use id_column and id_value to join A & B.

> df_B_long
# A tibble: 3 x 3
  dta   id_column id_value
  <chr> <chr>     <chr>   
1 FUU   id1       abc     
2 PAR   id2       xyz     
3 KOO   id2       zzz 

The joining part is clear, but to create your desired output, we need to do some data wrangling to make it look identical.

df_joined <- df_A_long %>%
    # join using id_column and id_value
    full_join(df_B_long, by = c("id_column","id_value"),suffix = c("1","2")) %>%
    # pivot back to long format
    pivot_wider(
        id_cols = c(dta1,dta2),
        names_from = id_column,
        values_from = id_value
    ) %>%
    # if dta1 is missing, then in the same row, move value from dta2 to dta1
    mutate(
        dta1_has_value = !is.na(dta1), # helper column
        dta1 = ifelse(dta1_has_value,dta1,dta2),
        dta2 = ifelse(!dta1_has_value & !is.na(dta2),NA,dta2)
    ) %>%
    select(-dta1_has_value) %>%
    group_by(dta1) %>%
    # condense multiple rows into one row
    summarise_all(
        ~ifelse(all(is.na(.x)),"",.x[!is.na(.x)])
    ) %>%
    # reorder columns
    {
        .[sort(colnames(df_joined))]
    }

Result:

> df_joined
# A tibble: 4 x 5
  dta1  dta2  id1   id2   id3  
  <chr> <chr> <chr> <chr> <chr>
1 BAR   ""    ""    ""    fgh  
2 FOO   FUU   abc   ""    def  
3 GOO   PAR   bcd   xyz   ""   
4 KOO   ""    ""    zzz   ""   
yusuzech
  • 5,896
  • 1
  • 18
  • 33
1
library(sqldf)
one <- 
  sqldf('
    select  a.*
            , b.dta as dta_b
    from    df_A a
            left join df_B b
              on  a.id1 <> ""
                  and (
                    a.id1 = b.id1
                    or a.id2 = b.id2)

  ')

two <- 
  sqldf('
    select  b.*
    from    df_B b
            left join one
              on  b.dta = one.dta
                  or b.dta = one.dta_b
    where   one.dta is null
  ')

dplyr::bind_rows(one, two)
#   dta id1 id2 id3 dta_b
# 1 FOO abc     def   FUU
# 2 BAR         fgh  <NA>
# 3 GOO bcd xyz       PAR
# 4 KOO     zzz      <NA>
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38