0

My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change that). The set of fields in the tables are similar though not identical - most exist in both tables, some only in one, some the other. The goal is to determine with "high confidence" whether or not two mailing addresses match.

Example fields:

  1. Street Number
  2. Predirection
  3. Street Name
  4. Street Suffix
  5. Postdirection (one table and not the other)
  6. Unit name (one table) v Address 2 (other table) --adds complexity
  7. Zip code (length varies in each table 5 v 5+ digits)
  8. Legal description

Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0). This call can be made in SQL or Python within my solution; free/open source highly preferred by client.

Among options such as SOUNDEX, DIFFERENCE, Levenshtein distance (all SQL) and usaddress, dedupe (Python) none stand out as a good-fit solution.

Kevin R. M.
  • 25
  • 1
  • 11
  • 1
    "high confidence" is the problem...you'l have to standardize the addresses in both tables before you can try and match them. You could look at something like Accuzip.com...but like I said you'll first need to standardize the addresses before you try to compare them. – BillRuhl Nov 22 '19 at 18:10
  • 1
    I agree with @BillRuhl about standardizing both address and then comparing them. Here is one [option](https://smartystreets.com/articles/usps-api), but it is going to cost something. – Isaac Nov 22 '19 at 18:14
  • 1
    Address standardization is trickier that one may think and will require constant monitoring. Take a peek at https://stackoverflow.com/questions/41249742/address-standardization-within-a-database/41249971#41249971 – John Cappelletti Nov 22 '19 at 18:33
  • 1
    Can you expand on "The fields in the tables are similar though not identical." -- your answer to this question is important for determining the best course of action. – Alan Burstein Nov 22 '19 at 18:51
  • Thanks all for responding! @JohnCappelletti your approach using Google Maps API for standardizing seems straightforward and cost effective for us. Please explain needing "constant monitoring." After better understanding standardization I will continue with comparison/similarity metrics. Thanks again. – Kevin R. M. Nov 22 '19 at 21:58
  • @KevinR.M. If you create a homegrown method, there will ALWAYS be new exceptions and considerations. I'd suspect you would require an ALERT process for outliers. Years ago, I created a rather sophisticated linking methodology which out performed Dunn & Bradstreet and even Experian. Though I was very proud of the process, it would never be 100% or completely defensible. – John Cappelletti Nov 22 '19 at 22:08

1 Answers1

3

Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0).

A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.

Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.

If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:

  1. The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
  2. The Jaccard Similarity coefficient algorithm.
  3. a form of the Jaro-Winkler distance algorithm. 4 a longest common subsequence algorithm (which grows by one when transpositions are involved)

If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.

You can easily create your own metric using NGrams8K.

For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).

I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).

Alan Burstein
  • 7,770
  • 1
  • 15
  • 18
  • I clarified my question above. There are 2 data sets (up to 9 eventually). You've given me more to run with regarding similarity metrics. Not sure which one I'll use yet but this sped up my investigation markedly. – Kevin R. M. Nov 22 '19 at 22:18