Joining two tables by cross checking two by two columns in R tidyverse

Question

Question:

How do I efficiently join two tables by cross checking 2x2 columns using R tidyverse and dplyr? I'm relatively new to R, but I cannot find this question addressed in any previous questions or discussions.

I have two tables with different numbers of rows and columns. Each table contains column A and B. These columns contain strings that can be either identical or unique, and they may also overlap or be missing from one or the other column. Basically, I need to check column A1 against both A2 and B2, and then check B1 against both A2 and B2.

Example to explain consept:

df1
ID          pISSN       eISSN       Level
437097                  1530-9932   1
489309      2366-004X   2366-0058   1
437103      0025-5858               1
437109      1042-9670   1545-7230   1
449363      1093-1139               0
437127                  0949-1775   1
437124      0361-3682   1873-6289   2
481203      0103-846X   0103-846X   1
479825      2153-2184   2153-2192   0
437136      0734-2071   1557-7333   2


df2
ID          pISSN       eISSN       Format
41120                   2364-9534   E OA S C
12249                   1530-9932   E OF S
261                     2366-0058   E OF S
12188       0025-5858   1865-8784   PE OF S
40596       1042-9670   1545-7230   PE OF S
12129       0895-4852   1936-4709   PE OF
769         0949-1775   1432-0517   PE OF S


result
ID          pISSN       eISSN       Level   Format
437097                  1530-9932   1       E OF S
489309      2366-004X   2366-0058   1       E OF S
437103      0025-5858   1865-8784   1       PE OF S
437109      1042-9670   1545-7230   1       PE OF S
437127                  0949-1775   1       PE OF S

Example tables for input:

dput(df1, file = "")
structure(list(ID = c(437097, 489309, 437103, 437109, 449363, 437127, 437124, 481203, 479825, 437136), pISSN = c(NA, "2366-004X", "0025-5858", "1042-9670", "1093-1139", NA, "0361-3682", "0103-846X", "2153-2184", "0734-2071"), eISSN = c("1530-9932", "2366-0058", NA, "1545-7230", NA, "0949-1775", "1873-6289", "0103-846X", "2153-2192", "1557-7333"), Level = c(1, 1, 1, 1, 0, 1, 2, 1, 0, 2)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

dput(df2, file = "")
structure(list(ID = c(41120, 12249, 261, 12188, 40596, 12129, 769), pISSN = c(NA, NA, NA, "0025-5858", "1042-9670", "0895-4852", "0949-1775"), eISSN = c("2364-9534", "1530-9932", "2366-0058", "1865-8784", "1545-7230", "1936-4709", "1432-0517"), Format = c("E OA S C", "E OF S", "E OF S", "PE OF S", "PE OF S", "PE OF", "PE OF S")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))

Interesting. Are you able to provide a reproducible example? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — william3031, Feb 01 '19 at 03:52
Hi and thanks for the response. I’m quite new to R, coding and Stack Overflow. So thanks for the info link, I’ve tried to set up a sample using dput. I've added it to the post. Hope this works. — jehaa, Feb 01 '19 at 11:40

Ramiro Magno · Accepted Answer · 2019-02-06T07:49:19.917

I think I now understand what you are trying to achieve.

Code

# Step 1
library(magrittr)
suppressMessages(library(dplyr))
library(fuzzyjoin)

# Step 2
df1 <- structure(list(ID = c(437097, 489309, 437103, 437109, 449363, 437127, 437124, 481203, 479825, 437136), pISSN = c(NA, "2366-004X", "0025-5858", "1042-9670", "1093-1139", NA, "0361-3682", "0103-846X", "2153-2184", "0734-2071"), eISSN = c("1530-9932", "2366-0058", NA, "1545-7230", NA, "0949-1775", "1873-6289", "0103-846X", "2153-2192", "1557-7333"), Level = c(1, 1, 1, 1, 0, 1, 2, 1, 0, 2)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(ID = c(41120, 12249, 261, 12188, 40596, 12129, 769), pISSN = c(NA, NA, NA, "0025-5858", "1042-9670", "0895-4852", "0949-1775"), eISSN = c("2364-9534", "1530-9932", "2366-0058", "1865-8784", "1545-7230", "1936-4709", "1432-0517"), Format = c("E OA S C", "E OF S", "E OF S", "PE OF S", "PE OF S", "PE OF", "PE OF S")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))

# Step 3  
my_match <- function(key1, key2) {
  match <- key1 == key2
  match[is.na(match)] <- FALSE
  return(match)
}

# Step 4 
bind_rows(
fuzzy_inner_join(df1, df2, 
                 by = c("pISSN" = "pISSN"), 
                 match_fun = list(my_match)),
fuzzy_inner_join(df1, df2, 
                 by = c("eISSN" = "pISSN"), 
                 match_fun = list(my_match)),
fuzzy_inner_join(df1, df2, 
                 by = c("pISSN" = "eISSN"), 
                 match_fun = list(my_match)),
fuzzy_inner_join(df1, df2, 
                 by = c("eISSN" = "eISSN"), 
                 match_fun = list(my_match))
) %>% # Step 5
  mutate(pISSN = coalesce(pISSN.x, pISSN.y),
         eISSN = coalesce(eISSN.x, eISSN.y)) %>%
  select(-c("pISSN.x", "pISSN.y", "eISSN.x", "eISSN.y")) %>%
  select("ID.x", "ID.y", "pISSN", "eISSN", "Level", "Format") -> result

result
#> # A tibble: 6 x 6
#>     ID.x  ID.y pISSN     eISSN     Level Format 
#>    <dbl> <dbl> <chr>     <chr>     <dbl> <chr>  
#> 1 437103 12188 0025-5858 1865-8784     1 PE OF S
#> 2 437109 40596 1042-9670 1545-7230     1 PE OF S
#> 3 437127   769 0949-1775 0949-1775     1 PE OF S
#> 4 437097 12249 <NA>      1530-9932     1 E OF S 
#> 5 489309   261 2366-004X 2366-0058     1 E OF S 
#> 6 437109 40596 1042-9670 1545-7230     1 PE OF S

Explanation of each step

I load some needed packages, namely: magrittr, for the pipe operator %>%; dplyr, for bind_rows (analogous to rbind), mutate and select; and, fuzzyjoin for fuzzy_inner_join.
Then I create your two example data frames: df1 and df2.
We define function my_match. This function does a match based on equality but when NAs are involved it returns FALSE (no match) instead of NA.
Then we use fuzzy_inner_join to do four joins by the following keys: (i) df1 "pISSN" and df2 "pISSN"; (ii) df1 "eISSN" and df2 "pISSN"; (iii) df1 "pISSN" and df2 "eISSN"; and (iv) df1 "eISSN" and df2 "eISSN". This is the part where we do what you called the cross checking 2x2 columns. We then wrap these four resulting data frames in bind_rows to have all those observations (rows) in one single data frame.
Finally we do some data wrangling to get the data frame into your desired shape: (i) we use mutate to create two new columns pISSN and eISSN from the columns pISSN.x (originally from df1) and pISSN.y (originally from df2), and from eISSN.x (from df1) and eISSN.y (from df2), respectively; (ii) and we use select to keep/discard the columns.

NB: Contrary to your intended result, I output two ID columns, one from the df1 and another from df2. In your post, you were keeping only the ID from data frame df1. But it is ambiguous which one to keep, so I kept the two. You can always discard one of the them with select(-ID.x) or select(-ID.y).

Thanks for this solution and thank you for the explanation of each step. It seems to work well, in that it catches all possible combinations. I do however get duplicates. I'm looking into whats happening, but they are easily managed. — jehaa, Mar 08 '19 at 07:32

score 1 · Answer 2 · answered Feb 01 '19 at 12:17

I am a bit confused by your example code and then the code you shared with dput because I am not sure how they are related... but here's my take on your question:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)

df1 <- structure(list(ID = c(437097, 489309, 437103, 437109, 449363, 437127, 437124, 481203, 479825, 437136), pISSN = c(NA, "2366-004X", "0025-5858", "1042-9670", "1093-1139", NA, "0361-3682", "0103-846X", "2153-2184", "0734-2071"), eISSN = c("1530-9932", "2366-0058", NA, "1545-7230", NA, "0949-1775", "1873-6289", "0103-846X", "2153-2192", "1557-7333"), Level = c(1, 1, 1, 1, 0, 1, 2, 1, 0, 2)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))


df2 <- structure(list(ID = c(41120, 12249, 261, 12188, 40596, 12129, 769), pISSN = c(NA, NA, NA, "0025-5858", "1042-9670", "0895-4852", "0949-1775"), eISSN = c("2364-9534", "1530-9932", "2366-0058", "1865-8784", "1545-7230", "1936-4709", "1432-0517"), Format = c("E OA S C", "E OF S", "E OF S", "PE OF S", "PE OF S", "PE OF", "PE OF S")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))

surrogate_key <- Vectorize(function(x, y) {
  str_c(sort(c(x, y)), collapse = "")
})

df1 %>% mutate(join_key = surrogate_key(pISSN, eISSN)) -> df3
df2 %>% mutate(join_key = surrogate_key(pISSN, eISSN)) -> df4

result <- full_join(df3, df4, "join_key") %>%
  select(-join_key)
#> Warning: Column `join_key` has different attributes on LHS and RHS of join

result
#> # A tibble: 15 x 8
#>      ID.x pISSN.x   eISSN.x   Level  ID.y pISSN.y   eISSN.y   Format  
#>     <dbl> <chr>     <chr>     <dbl> <dbl> <chr>     <chr>     <chr>   
#>  1 437097 <NA>      1530-9932     1 12249 <NA>      1530-9932 E OF S  
#>  2 489309 2366-004X 2366-0058     1    NA <NA>      <NA>      <NA>    
#>  3 437103 0025-5858 <NA>          1    NA <NA>      <NA>      <NA>    
#>  4 437109 1042-9670 1545-7230     1 40596 1042-9670 1545-7230 PE OF S 
#>  5 449363 1093-1139 <NA>          0    NA <NA>      <NA>      <NA>    
#>  6 437127 <NA>      0949-1775     1    NA <NA>      <NA>      <NA>    
#>  7 437124 0361-3682 1873-6289     2    NA <NA>      <NA>      <NA>    
#>  8 481203 0103-846X 0103-846X     1    NA <NA>      <NA>      <NA>    
#>  9 479825 2153-2184 2153-2192     0    NA <NA>      <NA>      <NA>    
#> 10 437136 0734-2071 1557-7333     2    NA <NA>      <NA>      <NA>    
#> 11     NA <NA>      <NA>         NA 41120 <NA>      2364-9534 E OA S C
#> 12     NA <NA>      <NA>         NA   261 <NA>      2366-0058 E OF S  
#> 13     NA <NA>      <NA>         NA 12188 0025-5858 1865-8784 PE OF S 
#> 14     NA <NA>      <NA>         NA 12129 0895-4852 1936-4709 PE OF   
#> 15     NA <NA>      <NA>         NA   769 0949-1775 1432-0517 PE OF S

@jehaa: I am guessing you find the `full_join` the difficult part. Take a look at Hadley's book on this topic: https://r4ds.had.co.nz/relational-data.html#mutating-joins. — Ramiro Magno, Feb 04 '19 at 14:44
Hi rmagno and thanks for your answer. I've tried to explain my examples more, thanks for the feedback. I'm still new to R, so I'm trying to understand how your suggestion works. Is it similar to the gather function? In that it merges two columns before the join? From what I can see in your result, it seems it does not catch every combination. E.g for df1 ID entry 489309. This should be connected to df2 ID 261 by eISSN. — jehaa, Feb 05 '19 at 14:36
@jehaa, can you make your examples consistent with your `dput` example code? — Ramiro Magno, Feb 05 '19 at 14:59
Thanks for the link. I've setup the examples so that they are now consistent. Exactly how the result is formatted is not as important, what matters is the matching. — jehaa, Feb 05 '19 at 21:14

Joining two tables by cross checking two by two columns in R tidyverse

2 Answers2

Code

Explanation of each step