9

I have two tables containing addresses (street, city, zipcode and two fields containing concatenated values of these), I would like to do fuzzy matching on Zipcode, but only for those cases which have exact same StrCity value. I have started with first selecting only addresses that are matching the StrCity from dictionary and then fuzzy matching, but there are two problems:

1) if matched by Zipcode, it doesn't take into account the street and city 2) if matched by Address (containing all of Zipcode, Street and City), it returns also possible values, where on the same zipcode there is another street that is close enough in terms of distance.

Probably I need something like doing two different matches at the same time (one fuzzy and one exact), but I am not sure how to implement it, while not killing my computer performance-wise.

Here is data sample of TableAd:

StrCity              ID      Zipcode Street       City     Address
BiałowiejskaWarszawa 5148676 01-459  Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa   6423687 02-730  Bukowińska   Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa     6425093 03-536  Kanałowa     Warszawa 03-536KanałowaWarszawa

And the dictionary sample:

Zipcode Street   City     Address                StrCity
02-882  Agaty    Warszawa 02-882AgatyWarszawa    AgatyWarszawa
03-663  Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536  Kołowa   Warszawa 03-536KołowaWarszawa   KołowaWarszawa

Here is my current code:

TableMatch <- merge(TableAd, TableDict, by="StrCity")
TableMatch <- TableMatch[, -grep("y", colnames(TableMatch))]
names(TableMatch)[names(TableMatch)=="Zipcode.x"] <- "Zipcode"
names(TableMatch)[names(TableMatch)=="Address.x"] <- "Address"

ResultTable <- TableMatch %>% 
  stringdist_left_join(TableDict, by="Address", distance_col="dist", method="lv", max_dist=5, ignore_case = TRUE) %>%
  select(ID, Zipcode.x, Address.x, Address.y, dist) %>% 
  group_by(Address.x) %>% 
  # select best fit record
  top_n(-1, dist)

The problem I found specifically with an example provided above - the script verifies that strCity KanałowaWarszawa is present in dictionary, but Levenshtein distance of combined Address string is the same when changing the zipcode as when changing the street to Kołowa, which has the same zipcode as the one inspected. Here it returns both changes, but if there would be just 2 or 1 digits difference in zipcode, then it might incorrectly suggest replacing the street while zipcode should be changed.

Note: I am using packages purrr, dplyr and fuzzyjoin.

zx8754
  • 52,746
  • 12
  • 114
  • 209
PrzeM
  • 211
  • 3
  • 15
  • 1
    Hi, I am facing a very similar issue - any solution that worked? – MCS Mar 06 '19 at 11:13
  • 1
    @MCS you might be interested by the answer I developped below, it might be useful for your own issue if the solution I proposed is not efficient enough. – moodymudskipper Mar 22 '19 at 13:09

2 Answers2

7

Here is a way to make it work, using regular fuzzyjoin functions, that are more flexible :

data

TableAd <- read.table(h=T,strin=F,text="StrCity ID Zipcode Street City Address
BiałowiejskaWarszawa 5148676 01-459  Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa   6423687 02-730  Bukowińska   Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa     6425093 03-536  Kanałowa     Warszawa 03-536KanałowaWarszawa")

TableDict <- read.table(h=T,strin=F,text="Zipcode Street   City StrCity
02-882  Agaty    Warszawa 02-882AgatyWarszawa    AgatyWarszawa
03-663  Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536  Kołowa   Warszawa 03-536KołowaWarszawa   KołowaWarszawa")

solution

library(fuzzyjoin)
library(stringdist)
res <- fuzzy_left_join(
  TableAd,
  TableDict,
  by=c("StrCity","Zipcode"),
  list(`==`, function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 5)
)
res %>% 
  select(StrCity = StrCity.x, everything(), - StrCity.y)

#                StrCity      ID Zipcode.x     Street.x   City.x                  Address.x Zipcode.y Street.y   City.y              Address.y
# 1 BialowiejskaWarszawa 5148676    01-459 Bialowiejska Warszawa 01-459BialowiejskaWarszawa      <NA>     <NA>     <NA>                   <NA>
# 2   BukowinskaWarszawa 6423687    02-730   Bukowinska Warszawa   02-730BukowinskaWarszawa      <NA>     <NA>     <NA>                   <NA>
# 3     KanalowaWarszawa 6425093    03-536     Kanalowa Warszawa     03-536KanalowaWarszawa    03-663 Kanalowa Warszawa 03-663KanalowaWarszawa

The problem of the above solution is that it makes a cartesian product internally, which might be problematic if you have a lot of data. The impact is diminished by the fact you're joining on concatenated strings but it feels like a hack that would be better avoided.

A way to sort this out would be to apply the fuzzy join on pairs of subsets determined by the exact matches, we define a function below to do that, along with enhanced sample data.

data

TableAd2 <- read.table(h=T,strin=F,text="ID Zipcode Street City
5148676 01-459  Białowiejska Warszawa
6423687 02-730  Bukowińska   Warszawa
6423687 99-999  Agaty        Warszawa
6423687 02-883  Agaty        Warszawa
6425093 03-536  Kanałowa     Warszawa")

TableDict2 <- read.table(h=T,strin=F,text="Zipcode Street City
02-882  Agaty    Warszawa
03-663  Kanałowa Warszawa
03-536  Kołowa   Warszawa
02-730  Bukowińska Warszawa")
  • Bukowińska should be matched as its Zipcode matches perfectly
  • Kanałowa should be matched as only 3 numbers from its zipcode differ
  • Agaty should be matched for 1 item only as 5 characters are different and we'll admit max 3

function

fuzzy_inner_join2 <- function(x,y,by, match_fun, ...){
  match_fun_equal_lgl <- sapply(match_fun, identical, `==`)
  # columns to use for exact join equivalent
  by_exact = by[match_fun_equal_lgl]
  # columns to use for fuzzy join on relevant subsets of data (for efficiency)
  by_fuzzy = by[!match_fun_equal_lgl]
  # update match_fun
  match_fun <- match_fun[!match_fun_equal_lgl]
  # trim inputs of irrelevant data
  x <- dplyr::semi_join(x,y,by= by_exact)
  y <- dplyr::semi_join(y,x,by= by_exact)
  # make lists so we have pairs of data frames to fuzzy join together
  x_list <- dplyr::group_split(dplyr::group_by_at(x, by_exact))
  y_list <- dplyr::group_split(dplyr::group_by_at(y, by_exact), keep = FALSE)
  # apply fuzzy join on pairs and bind the results
  map2_dfr(x_list,y_list, fuzzyjoin::fuzzy_inner_join, match_fun = match_fun,
           by = by_fuzzy, ...)
}

solution

fuzzy_inner_join2(
  TableAd2,
  TableDict2,
  by=c("City","Street","Zipcode"),
  match_fun = list(
    `==`, `==`,
    function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 3)
)

# # A tibble: 3 x 5
#        ID Zipcode.x Street     City     Zipcode.y
#     <int> <chr>     <chr>      <chr>    <chr>    
# 1 6423687 02-883    Agaty      Warszawa 02-882   
# 2 6423687 02-730    Bukowinska Warszawa 02-730   
# 3 6425093 03-536    Kanalowa   Warszawa 03-663
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
0

To do partial fuzzy and partial exact matching with fuzzyjoin, you can input multiple match_fun's and customize your own. Here I set an exact match == for strcity and stringdist for zipcode and address. To do this, I needed to get the stringdist match_fun code and customize it.

For more accuracy on matching zipcode, I imagine you may want to break out the digits and then use a match_fun for numerical closeness rather than stringdist.

library(fuzzyjoin); library(dplyr)

# First, need to define match_fun_stringdist 
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {

  ignore_case = TRUE
  method = "lv"
  max_dist = 99
  distance_col = "dist"

  if (ignore_case) {
    v1 <- stringr::str_to_lower(v1)
    v2 <- stringr::str_to_lower(v2)
  }

  # shortcut for Levenshtein-like methods: if the difference in
  # string length is greater than the maximum string distance, the
  # edit distance must be at least that large

  # length is much faster to compute than string distance
  if (method %in% c("osa", "lv", "dl")) {
    length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
    include <- length_diff <= max_dist

    dists <- rep(NA, length(v1))

    dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
  } else {
    # have to compute them all
    dists <- stringdist::stringdist(v1, v2, method = method)
  }
  ret <- dplyr::data_frame(include = (dists <= max_dist))
  if (!is.null(distance_col)) {
    ret[[distance_col]] <- dists
  }
  ret
}


# Now, call fuzzy_join with multiple match_fun
fuzzy_join(data1, data2, 
           by = list(x = c("Address", "Zipcode", "StrCity"), y = c("Address", "Zipcode", "StrCity")), 
           match_fun = list(match_fun_stringdist, match_fun_stringdist, `==`),
           mode = "left"
) %>%
  group_by(StrCity, Zipcode, Address) %>%
  top_n(-1, Address.dist) %>%
  select(Address.dist, everything())
Arthur Yip
  • 5,810
  • 2
  • 31
  • 50