I was initially helped on doing a fuzzy match in a previous post
I would like to match two datasets based on arbitrary address fields relative to each other using R
credit to: @Ronak Shah, @r2evans, and @akrun for previous help
That was helpful and i got the fuzzy match I wanted based on these two datasets
structure(list(ID = 1:8, Address = c("Canal and Broadway", "55 water street room number 73",
"Mulberry street", "Front street and Fulton", "62nd street ",
"wythe street", "vanderbilt avenue", "South Beach avenue")), class = "data.frame", row.names = c(NA,
-8L))
and
structure(list(ID2 = 1:8, Address = c("Canal & Broadway", "Somewhere around 55 water street",
"Mulberry street", "Front street and close to Fulton", "south beach avenue",
"along wythe street on the southwest ", "vanderbilt ave", "62nd street"
)), class = "data.frame", row.names = c(NA, -8L))
running
fuzzyjoin::stringdist_left_join(df1, df2, by = 'Address', max_dist = 5)
gave me
structure(list(ID = 1:8, Address.x = c("Canal and Broadway",
"55 water street room number 73", "Mulberry street", "Front street and Fulton",
"62nd street ", "wythe street", "vanderbilt avenue", "South Beach avenue"
), ID2 = c(1L, NA, 3L, NA, 8L, 8L, 7L, 5L), Address.y = c("Canal & Broadway",
NA, "Mulberry street", NA, "62nd street", "62nd street", "vanderbilt ave",
"south beach avenue")), row.names = c(NA, -8L), class = "data.frame")
The match did as good as it can and I accept that. What i want to do next is match df1_new and df2_new
df1
structure(list(ID = 1:8, Address = c("Canal and Broadway", "55 water street room number 73",
"Mulberry street", "Front street and Fulton", "62nd street ",
"wythe street", "vanderbilt avenue", "South Beach avenue"), Age = c(32L,
33L, 37L, 39L, 38L, 50L, 60L, 42L), Name = c("John ", "Adam",
"Alan", "Greg", "Phil", "Anthony", "Mike", "Mark")), class = "data.frame", row.names = c(NA,
-8L))
and df 2
structure(list(ID2 = 1:8, Address = c("Canal & Broadway", "Somewhere around 55 water street",
"Mulberry street", "Front street and close to Fulton", "south beach avenue",
"along wythe street on the southwest ", "vanderbilt ave", "62nd street"
), Age = c(32L, 33L, 37L, 39L, 42L, 50L, 60L, 35L), Name = c("John",
"Adam", "Ryan", "Greg", "Mark", "Anthony", "Mike", "Phil")), class = "data.frame", row.names = c(NA,-8L))
Normally I would run
df3<-df1 %>% left_join(df2, by=c("Address","Age","Name")
However Address variable needs to go through a fuzzy match while the others can go through standard. I expect to put the left_join and fuzzyjoin::stringdist_left_join function together
ID Address.x D2 Address.y Age Name
1 Canal and Broadway 1 Canal & Broadway 32 John
2 55 water street room number 73
3 Mulberry street
4 Front street and Fulton
5 62nd street 8 62nd street
6 wythe street
7 vanderbilt avenue 7 vanderbilt ave 60 Mike
8 South Beach avenue 5 south beach avenue 42 Mark
Note that though 62nd street and Mulberry street matched on fuzzy match they did not have the same corresponding Age and Name.