1

I have two tables filled with facility names and zip codes (and a bunch of irrelevant other data) each. The tables come from two separate sources and I'd like to link them. The problem is, the facility names are many times written slightly differently (abbreviations, apostrophes, etc.). The Zip code happens to be a good limiter, cutting down several thousand potential matches to only a handful. My problem is,

How can I match the two tables given a zip code match and an approximate name match:

|facility|zip  |         |facility|zip  |
|azyt    |10000| No Match|aaaa    |10000|
|abba    |10000| Match   |abb'    |10000|

One method I thought of was for each row in table A, comb through table B's zip codes and find all rows that have a matching zip code. Then somehow test which name is MOST similar, likely by comparing the left most characters (?). There is always a match, just not perfect.

Here's my incomplete code:

 Dim facilityName, facilityZip, otherName

  For i = 1 To Worksheets("Facility").UsedRange.Rows.Count
    facilityName = Worksheets("Facility").Cells(i, 2)
    facilityZip = Worksheets("Facility").Cells(i, 4)

    'Grab all rows in OTHER that have same ZIP
    For j = 1 To Worksheets("Other").UsedRange.Rows.Count
      otherName = Array()
      ub = UBound(otherName) + 1
      If Worksheets("Other").Cells(j, 3).Value = facilityZip Then
        ReDim Preserve otherName(0 To ub)
        otherName(ub) = j
      End If
    Next j

    'Compare names
     For Each rw In otherName
       'here I would compare each result to the current facilityName. There's likely a better way to do this...
     Next rw

    Next i

Either help on this code, a suggestion or even a different approach would be very helpful!

Thank you!

Community
  • 1
  • 1
ZAR
  • 2,550
  • 4
  • 36
  • 66
  • You don't really need the `otherName` array - you can do the comparison directly in the inner loop where you're scanning the "other" sheet. – Tim Williams Oct 20 '14 at 23:15
  • You're probably right. My thought is, since I'm looking for the CLOSEST match, I wanted to first store all possible matches and then find which is closest. Some matches will be exact while others may be process of elimination. Furthermore, I thought there may be a way to optimize the code by excluding any possible 'otherSheet' candidates that have already been selected. But that isn't necessary. – ZAR Oct 20 '14 at 23:22
  • 1
    Good point: if you need to rank possible matches then your existing code is a good approach. Exactly how you rank matches is the question i suppose: you could start off with a substring match and maybe something like checking the Levenshtein difference: http://stackoverflow.com/questions/21276721/find-near-duplicates-of-comma-separated-lists-using-levenshtein-distance/21375233#21375233 – Tim Williams Oct 20 '14 at 23:33
  • Hmm, i wonder how the levenshtein method would work here. For instance, one sheet might say 'garden meadow covalence' and match with simply 'garden meadow'. But adding that whole word might be more 'distant' than something like 'garden mall'. But maybe if we limit the size of the string to the smaller, hmmm. What do you think? – ZAR Oct 20 '14 at 23:42
  • I'm no expert in this area - Levenshtein Distance is the only "formal" method I know of, but I'm sure there are plenty of others. If your variation mostly occurs at the end of the strings then limiting the sizes might be a good idea. Basically you need to try out some different algorithms and see what gives you the "best" results. – Tim Williams Oct 21 '14 at 00:00
  • Yes, the answer is similar. In the end, the only difference is keeping the two string lengths equal. Thank you @Tim for your help and thank you Jean for the duplicate notification. Cheers guys! – ZAR Oct 21 '14 at 16:09

1 Answers1

0

Using the Levenshtein Distance method from here:

Dim facilityName, facilityZip, hudName, hudZip, provNum, lowerLen, levNum, tmpLev, resultHudNum, resultHudName


For i = 1 To Worksheets("Facility").UsedRange.Rows.Count
  With Worksheets("Facility")
    facilityName = .Cells(i, 2)
    facilityZip = .Cells(i, 3)
  End With
  levNum = 5
  resultHudNum = 0
  For j = 1 To Worksheets("Sheet1").UsedRange.Rows.Count
    With Worksheets("Sheet1")
      hudName = .Cells(j, 2)
      hudZip = .Cells(j, 4)
      hudNum = .Cells(j, 1)
    End With
    If hudZip = facilityZip Then
      If Len(facilityName) < Len(hudName) Then
        lowerLen = Len(facilityName)
      Else
        lowerLen = Len(hudName)
      End If
      tmpLev = Levenshtein(Mid(facilityName, 1, lowerLen), Mid(hudName, 1, lowerLen))
      If tmpLev < levNum Then
          levNum = tmpLev
          resultHudNum = hudNum
          resultHudName = hudName
      End If
    End If
  Next j
  If resultHudNum <> 0 Then
    Worksheets("Facility").Cells(i, 4).Value = resultHudNum
    Worksheets("Facility").Cells(i, 5).Value = resultHudName
  Else
    Worksheets("Facility").Cells(i, 4).Value = "000000"
    Worksheets("Facility").Cells(i, 5).Value = ""
  End If
Next i

Required level of similarity between stings can be modified by changing the value in levNum.

Thank you @Tim for the help!

Community
  • 1
  • 1
ZAR
  • 2,550
  • 4
  • 36
  • 66