I have two tables containing addresses (street, city, zipcode and two fields containing concatenated values of these), I would like to do fuzzy matching on Zipcode, but only for those cases which have exact same StrCity value. I have started with first selecting only addresses that are matching the StrCity from dictionary and then fuzzy matching, but there are two problems:
1) if matched by Zipcode, it doesn't take into account the street and city 2) if matched by Address (containing all of Zipcode, Street and City), it returns also possible values, where on the same zipcode there is another street that is close enough in terms of distance.
Probably I need something like doing two different matches at the same time (one fuzzy and one exact), but I am not sure how to implement it, while not killing my computer performance-wise.
Here is data sample of TableAd:
StrCity ID Zipcode Street City Address
BiałowiejskaWarszawa 5148676 01-459 Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa 6423687 02-730 Bukowińska Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa 6425093 03-536 Kanałowa Warszawa 03-536KanałowaWarszawa
And the dictionary sample:
Zipcode Street City Address StrCity
02-882 Agaty Warszawa 02-882AgatyWarszawa AgatyWarszawa
03-663 Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536 Kołowa Warszawa 03-536KołowaWarszawa KołowaWarszawa
Here is my current code:
TableMatch <- merge(TableAd, TableDict, by="StrCity")
TableMatch <- TableMatch[, -grep("y", colnames(TableMatch))]
names(TableMatch)[names(TableMatch)=="Zipcode.x"] <- "Zipcode"
names(TableMatch)[names(TableMatch)=="Address.x"] <- "Address"
ResultTable <- TableMatch %>%
stringdist_left_join(TableDict, by="Address", distance_col="dist", method="lv", max_dist=5, ignore_case = TRUE) %>%
select(ID, Zipcode.x, Address.x, Address.y, dist) %>%
group_by(Address.x) %>%
# select best fit record
top_n(-1, dist)
The problem I found specifically with an example provided above - the script verifies that strCity KanałowaWarszawa is present in dictionary, but Levenshtein distance of combined Address string is the same when changing the zipcode as when changing the street to Kołowa, which has the same zipcode as the one inspected. Here it returns both changes, but if there would be just 2 or 1 digits difference in zipcode, then it might incorrectly suggest replacing the street while zipcode should be changed.
Note: I am using packages purrr
, dplyr
and fuzzyjoin
.