There are a lot of algorithms for doing fuzzy matching. One of the easier ones to implement in excel is N-Gram.
To perform an n-gram match, we have to break each address up into a list of sets of smaller character lengths. A 2-gram list of your address 246 N High street
would look like 24,46,6 , N,N , H,Hi,ig,gh,h , s,st,tr,re,ee,et
. We could do the same with a 3-gram: 246,46 ,6 N, N ,N H, Hi,Hig,igh,gh ,h s, st,str,tre,ree,eet
We do this with both addresses, then we can check each item in the first address's list to see if it appears in the second address's list; count the matches and divide that by the number of items in the first list. That will give you a percentage of how close they are.
You could get fancy with cell formulas mid()
and countif()
to do this with sheet formulas, but I think it's easier to just write it out in VBA and make it a UDF.
Function NGramCompare(string1 As String, string2 As String, intGram As Integer) As Double
'Take in two strings and the N-gram
Dim intChar As Integer, intGramMatch As Integer
Dim ngramList1 As String, ngramList2 As String, nGram As Variant
Dim nGramArr1 As Variant
'split the first string into a list of ngrams
For intChar = 1 To Len(string1) - (intGram-1)
If ngramList1 <> "" Then ngramList1 = ngramList1 & ","
ngramList1 = ngramList1 & Mid(string1, intChar, intGram)
Next intChar
'split the secong string into a list of ngrams
For intChar = 1 To Len(string2) - (intGram-1)
If ngramList2 <> "" Then ngramList2 = ngramList2 & ","
ngramList2 = ngramList2 & Mid(string2, intChar, intGram)
Next intChar
'Split the ngramlist1 into an array through which we can iterate
nGramArr1 = Split(ngramList1, ",")
'Iterate through array and compare values to ngramlist2
For Each nGram In nGramArr1
If InStr(1, ngramList2, nGram) Then
'we found a match, add to the counter
intGramMatch = intGramMatch + 1
End If
Next nGram
'output the percentage of grams matching.
NGramCompare = intGramMatch / (UBound(nGramArr1) + 1)
End Function
If you've never used a UDF:
- Go to visual basic editor (VBE) with Alt+F11
- In the VBA Project window, find your workbook and right click on the name
- Choose: Insert>>Module
- Double click the new module in the list to bring up it's code window
- Paste this function in and save your workbook
Then, assuming address1 is in A1
and address2 is in B1
you can put, in C1
:
=NGramCompare(A1, B1, 2)
Which, for your first address, will spit out 56%. Which seems like a reasonably good match. If you find you are getting too many positive hits, you can change your 2-gram to be a 3-gram by changing that last parameter.
To take it a step further so it will say "Update" or "Omit" you could do:
=If(NGramCompare(A1, B1, 2)>.30, "Update", "Omit")
I just set that so that it will consider a match anything above 30%, but you can adjust as necessary. No matter where you set it, you will probably end up with a percentage of compares that are false positives or false negatives, but that's the way fuzzy matching goes.