I have two dataframes I want to join. They share two fields: group_id
and person_name
. I want to join exactly on group_id
and fuzzy on person_name
. How can I do this?
Constraints:
- It should be an inner join. So
group_id
exactly andperson_name
fuzzy must appear in both the left and right frames. - The real dataframes are large. I have tried the answer suggested by David Robinson using his package fuzzyjoin, but there is too much data to create a Cartesian product before filtering.
- I'd love an answer in the tidyverse but it's not strictly necessary.
Here is a small example:
a = data.frame(
group_id=c(1,2,2,3,3,3),
person_name=c('Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'),
eye_color=c('brown', 'green', 'blue', 'brown', 'green', 'blue')
)
b = data.frame(
group_id=c(2,2,2,3,3,3,3),
person_name=c('Alie', 'Bobo', 'Charles', 'Charlie', 'Davis', 'Eva', 'Zed' ),
hair_color=c('brown', 'brown', 'black', 'grey', 'brown', 'black', 'blond')
)
expected = data.frame(
group_id=c(2,2,3,3),
person_name_x=c('Bob', 'Charlie', 'David', 'Eve'),
person_name_y=c('Bobo', 'Charles', 'Davis', 'Eva'),
eye_color=c('green', 'blue', 'brown', 'green'),
hair_color=c('brown', 'black', 'brown', 'black')
)