Recommendation for mailing address matching scenario?

Question

My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change that). The set of fields in the tables are similar though not identical - most exist in both tables, some only in one, some the other. The goal is to determine with "high confidence" whether or not two mailing addresses match.

Example fields:

Street Number
Predirection
Street Name
Street Suffix
Postdirection (one table and not the other)
Unit name (one table) v Address 2 (other table) --adds complexity
Zip code (length varies in each table 5 v 5+ digits)
Legal description

Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0). This call can be made in SQL or Python within my solution; free/open source highly preferred by client.

Among options such as SOUNDEX, DIFFERENCE, Levenshtein distance (all SQL) and usaddress, dedupe (Python) none stand out as a good-fit solution.

"high confidence" is the problem...you'l have to standardize the addresses in both tables before you can try and match them. You could look at something like Accuzip.com...but like I said you'll first need to standardize the addresses before you try to compare them. — BillRuhl, Nov 22 '19 at 18:10
I agree with @BillRuhl about standardizing both address and then comparing them. Here is one [option](https://smartystreets.com/articles/usps-api), but it is going to cost something. — Isaac, Nov 22 '19 at 18:14
Address standardization is trickier that one may think and will require constant monitoring. Take a peek at https://stackoverflow.com/questions/41249742/address-standardization-within-a-database/41249971#41249971 — John Cappelletti, Nov 22 '19 at 18:33
Can you expand on "The fields in the tables are similar though not identical." -- your answer to this question is important for determining the best course of action. — Alan Burstein, Nov 22 '19 at 18:51
Thanks all for responding! @JohnCappelletti your approach using Google Maps API for standardizing seems straightforward and cost effective for us. Please explain needing "constant monitoring." After better understanding standardization I will continue with comparison/similarity metrics. Thanks again. — Kevin R. M., Nov 22 '19 at 21:58
@KevinR.M. If you create a homegrown method, there will ALWAYS be new exceptions and considerations. I'd suspect you would require an ALERT process for outliers. Years ago, I created a rather sophisticated linking methodology which out performed Dunn & Bradstreet and even Experian. Though I was very proud of the process, it would never be 100% or completely defensible. — John Cappelletti, Nov 22 '19 at 22:08

score 3 · Answer 1 · answered Nov 22 '19 at 19:27

Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0).

A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.

Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.

If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:

The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
The Jaccard Similarity coefficient algorithm.
a form of the Jaro-Winkler distance algorithm. 4 a longest common subsequence algorithm (which grows by one when transpositions are involved)

If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.

You can easily create your own metric using NGrams8K.

For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).

I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).

I clarified my question above. There are 2 data sets (up to 9 eventually). You've given me more to run with regarding similarity metrics. Not sure which one I'll use yet but this sped up my investigation markedly. — Kevin R. M., Nov 22 '19 at 22:18

Recommendation for mailing address matching scenario?

1 Answers1

Linked