-1

I was initially helped on doing a fuzzy match in a previous post

I would like to match two datasets based on arbitrary address fields relative to each other using R

credit to: @Ronak Shah, @r2evans, and @akrun for previous help

That was helpful and i got the fuzzy match I wanted based on these two datasets

structure(list(ID = 1:8, Address = c("Canal and Broadway", "55 water street room number 73", 
"Mulberry street", "Front street and Fulton", "62nd street ", 
"wythe street", "vanderbilt avenue", "South Beach avenue")), class = "data.frame", row.names = c(NA, 
-8L))

and

structure(list(ID2 = 1:8, Address = c("Canal & Broadway", "Somewhere around 55 water street", 
"Mulberry street", "Front street and close to Fulton", "south beach avenue", 
"along wythe street on the southwest ", "vanderbilt ave", "62nd street"
)), class = "data.frame", row.names = c(NA, -8L))

running

fuzzyjoin::stringdist_left_join(df1, df2, by = 'Address', max_dist = 5)

gave me

structure(list(ID = 1:8, Address.x = c("Canal and Broadway", 
"55 water street room number 73", "Mulberry street", "Front street and Fulton", 
"62nd street ", "wythe street", "vanderbilt avenue", "South Beach avenue"
), ID2 = c(1L, NA, 3L, NA, 8L, 8L, 7L, 5L), Address.y = c("Canal & Broadway", 
NA, "Mulberry street", NA, "62nd street", "62nd street", "vanderbilt ave", 
"south beach avenue")), row.names = c(NA, -8L), class = "data.frame")

The match did as good as it can and I accept that. What i want to do next is match df1_new and df2_new

df1

structure(list(ID = 1:8, Address = c("Canal and Broadway", "55 water street room number 73", 
"Mulberry street", "Front street and Fulton", "62nd street ", 
"wythe street", "vanderbilt avenue", "South Beach avenue"), Age = c(32L, 
33L, 37L, 39L, 38L, 50L, 60L, 42L), Name = c("John ", "Adam", 
"Alan", "Greg", "Phil", "Anthony", "Mike", "Mark")), class = "data.frame", row.names = c(NA, 
 -8L))

and df 2

structure(list(ID2 = 1:8, Address = c("Canal & Broadway", "Somewhere around 55 water street", 
"Mulberry street", "Front street and close to Fulton", "south beach avenue", 
"along wythe street on the southwest ", "vanderbilt ave", "62nd street"
), Age = c(32L, 33L, 37L, 39L, 42L, 50L, 60L, 35L), Name = c("John", 
"Adam", "Ryan", "Greg", "Mark", "Anthony", "Mike", "Phil")), class = "data.frame", row.names = c(NA,-8L))

Normally I would run

df3<-df1 %>% left_join(df2, by=c("Address","Age","Name")

However Address variable needs to go through a fuzzy match while the others can go through standard. I expect to put the left_join and fuzzyjoin::stringdist_left_join function together

ID   Address.x                       D2   Address.y        Age        Name
1   Canal and Broadway                1   Canal & Broadway  32        John
2   55 water street room number 73    
3   Mulberry street                   
4   Front street and Fulton           
5   62nd street                       8 62nd street
6   wythe street                      
7   vanderbilt avenue                 7 vanderbilt ave      60        Mike
8   South Beach avenue                5 south beach avenue  42        Mark

Note that though 62nd street and Mulberry street matched on fuzzy match they did not have the same corresponding Age and Name.

user35131
  • 1,105
  • 6
  • 18
  • Likely duplicate: https://stackoverflow.com/questions/56009863/how-to-fuzzy-join-based-on-multiple-columns-and-conditions – Ian Campbell May 05 '21 at 15:38

1 Answers1

0
fuzzyjoin::stringdist_left_join(df1_new, df2_new ['Address'], by = 'Address', max_dist 
= 5) %>%
mutate(Address.z=Address.y) %>% left_join(df2_new %>% 
mutate(Address.z=Address),by=c("Age","Name", "Address.z"))

This got me the result I was looking for.

user35131
  • 1,105
  • 6
  • 18