Inner join exactly on one column and fuzzy on another

Question

I have two dataframes I want to join. They share two fields: group_id and person_name. I want to join exactly on group_id and fuzzy on person_name. How can I do this?

Constraints:

It should be an inner join. So group_id exactly and person_name fuzzy must appear in both the left and right frames.
The real dataframes are large. I have tried the answer suggested by David Robinson using his package fuzzyjoin, but there is too much data to create a Cartesian product before filtering.
I'd love an answer in the tidyverse but it's not strictly necessary.

Here is a small example:

a = data.frame(
    group_id=c(1,2,2,3,3,3),
    person_name=c('Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'),
    eye_color=c('brown', 'green', 'blue', 'brown', 'green', 'blue')
)
b = data.frame(
    group_id=c(2,2,2,3,3,3,3),
    person_name=c('Alie', 'Bobo', 'Charles', 'Charlie', 'Davis', 'Eva', 'Zed' ),
    hair_color=c('brown', 'brown', 'black', 'grey', 'brown', 'black', 'blond')
)
expected = data.frame(
    group_id=c(2,2,3,3),
    person_name_x=c('Bob', 'Charlie', 'David', 'Eve'),
    person_name_y=c('Bobo', 'Charles', 'Davis', 'Eva'),
    eye_color=c('green', 'blue', 'brown', 'green'),
    hair_color=c('brown', 'black', 'brown', 'black')
)

You might want to check out the packages `RecordLinkage` (see answer) or [`fastLink`](https://github.com/kosukeimai/fastLink), which allow blocking and fuzzy matching. — lukeA, Feb 11 '18 at 12:41

lukeA · Answer 1 · 2018-02-11T12:34:28.730

You could try

library(RecordLinkage)
library(tidyverse)
compare.linkage(a, b, strcmp = 2, exclude=3, blockfld = 1) %>% 
  epiWeights %>% 
  epiClassify(.8) %>% 
  getPairs(show="links", single.rows=T) %>% 
  .[(c(2,3,7,4,8))]
# group_id.1 person_name.1 person_name.2 eye_color.1 hair_color.2
# 3          2       Charlie       Charles        blue        black
# 2          2           Bob          Bobo       green        brown
# 4          3         David         Davis       brown        brown
# 5          3           Eve           Eva       green        black

score 1 · Answer 2 · answered Feb 11 '18 at 06:02

In this example, we basically need a hybrid join. For one column (group_id), we need an exact match of column names whereas for the other column (person_name) we need a fuzzy join.

One way to do this :

library(fuzzyjoin)
common_id <- intersect(a$group_id, b$group_id)
stringdist_inner_join(a[a$group_id %in% common_id, ], b[b$group_id %in% common_id, ], 
                                                      by = "person_name")

# group_id.x person_name.x eye_color group_id.y person_name.y hair_color
#        <dbl>        <fctr>    <fctr>      <dbl>        <fctr>     <fctr>
#1          2           Bob     green          2          Bobo      Brown
#2          2       Charlie      blue          2       Charles      Black
#3          3         David     brown          3         Davis      Brown
#4          3           Eve     green          3           Eva      Black

Here, we first find those common group_id's using intersect which are present in both the dataframes and filter them accordingly from a and b and then use stringdist_inner_join function on only person_name column. We can later remove the extra group_id column which has been generated.

This does not check the exact match on `group_id`. I edited my test case to demonstrate. Doing this and then filtering afterwards takes too much memory. — Hatshepsut, Feb 11 '18 at 08:52
@Hatshepsut so you tried `df <- stringdist_inner_join(a, b, by = "person_name"); df[df$group_id.x == df$group_id.y, ]` right? — Ronak Shah, Feb 11 '18 at 11:02

Inner join exactly on one column and fuzzy on another

2 Answers2