1

I'm really just looking for some kind of tool that will check for close approximations of duplicates in a column of data. For instance, say I have a column of data with addresses as such:

  • 113 James Way
  • 3448 Harlon Circle
  • 5888 Murray Rd
  • 3448 Harlon Cr.

In this case entry 2 and 4 would be very close to unique and I would like some kind of tool, either in excel or standalone, that would notify me if rows are being duplicated or approximately duplicated. I have no idea how to even search for something like this. I tried searches for fuzzy match tools and the like but nothing is quite what I need. Thanks,

2 Answers2

0

There are several ways to approach

One simple method is to write a Levenshtein function to compare these addressed with each other and highlight low values

Assume you have the data setup as follows

enter image description here

Raw example

Sub FindClosestMatch()
Range("B3").Select
Dim mystrings()
 Range("B3").Select
 Range(Selection, Selection.End(xlDown)).Select
 mystrings = Selection.Value

i = 0
Dim string1 As String, string2 As String
 Range("C3").Select
For i = LBound(mystrings) To UBound(mystrings)
    string1 = mystrings(i, 1)
    For j = 1 To 4
    string2 = mystrings(j, 1)
    ActiveCell.Value = Levenshtein(string1, string2)
    ActiveCell.Offset(0, 1).Select
    Next
    Range("c3").Offset(i, 0).Select

Next
End Sub

How to read values

For e.g 113 James Way 0 15 13 12 means the string has a score of

  • 0 (exact match) with itself
  • 15 with 3448 Harlon Circle
  • 13 With 5888 Murray Rd
  • 12 with 3448 Harlon Cr.

etc

The Macro just compares every address with other address and finds the Levenshtein distance

The lower the number the closest match they are and clearly 0 is exact match when it compares to itself

This macro assumes you have copied the Levenshtein function into your VBA Module

Community
  • 1
  • 1
Ravi Yenugu
  • 3,895
  • 5
  • 40
  • 58
0

It really depends on how accurate you need it to be and what kind of close matches you want it to catch. If you want to catch typos it'd be a lot harder. But if you're mainly looking to catch St vs Street you could do a vlookup on the left(address, #) or something. Might have to toy with the # to get a good response. # needs to be higher then the number of digits in the street numbers (4/5?) but small enough to catch things like 1 dry ct. I'd guess 7-8.

Basically your addresses are in column A (assuming starting in A2 with headers). Column B says = left(a2,8) A2 is obviously unique cause it's first. Start in C3 with =vlookup(left(a3,8),$B$2:B2,1,0)

It'll print an error for all the unique entries and an address for the dupilcates. To make it cleaner you can add an if(iserror()) with =if(iserror(vlookup(left(a3,8),$B$2:B2,1,0), "", vlookup(left(a3,8),$B$2:B2,1,0))