Fuzzy string matching optimization (not checking certain words) - Excel VBA function

Question

I have a function in Excel that calculates the Levenshtein Distance between two strings (the number of insertions, deletions, and/or substitutions needed to transform one string into another). I am using this as part of a project I'm working on that involves "fuzzy string matching."

Below you will see the code for the LevenshteinDistance function and a valuePhrase function. The latter exists for the purposes of executing the function in my spreadsheet. I have taken this from what I read in this thread.

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)`

Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
    Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
    Dim i As Long, j As Long, cost As Long 'loop counters and cost of 
        'substitution for current letter
    Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and 
        Substitution

    L1 = Len(S1): L2 = Len(S2)
    ReDim D(0 To L1, 0 To L2)
    For i = 0 To L1: D(i, 0) = i: Next i
    For j = 0 To L2: D(0, j) = j: Next j

    For j = 1 To L2
        For i = 1 To L1
            cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
            cI = D(i - 1, j) + 1
            cD = D(i, j - 1) + 1
            cS = D(i - 1, j - 1) + cost
            If cI <= cD Then 'Insertion or Substitution
                If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
            Else 'Deletion or Substitution
                If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
            End If
        Next i
    Next j
    LevenshteinDistance = D(L1, L2)

End Function

Public Function valuePhrase#(ByRef S1$, ByRef S2$)

    valuePhrase = LevenshteinDistance(S1, S2)

End Function

I am executing this valuePhrase function in a table in one of my sheets where the column and row headers are names of insurance companies. Ideally, the smallest number in any given row (the shortest Levenshtein distance) should correspond to a column header with the name of the insurance company in the table that most closely matches the name of that insurance company in the row header.

My problem is that I am trying to calculate this in a case where the strings in question are names of insurance companies. With that in mind, the code above strictly calculates the Levenshtein distance and is not tailored specifically to this case. To illustrate, a simple example of why this can be an issue is because the Levenshtein distance between two insurance company names can be quite small if they both share the words "insurance" and "company" (which, as you might expect, is common), even if the insurance companies have totally different names with respect to their unique words. So, I may want the function to ignore those words when comparing two strings.

I am new to VBA. Is there a way I can implement this fix in the code? As a secondary question, are there other unique issues that could arise from comparing the names of insurance companies? Thank you for the help!

replace "insurance" and "company" in `s1` and `s2` by an empty string before calling `LevenshteinDistance` — Florent B., Oct 24 '17 at 15:39
@FlorentB. I'm not sure that I understand your answer completely. Where am I replacing the words "insurance" and "company" with an empty string? In the names of the insurance companies in my sheet? I still want to be able to see the full name of the company (for example, if it says "All America Insurance Company," I don't want my column header to be changed to "All America"). I want the function to ignore those words without having to remove those words from the strings in the table itself. — larryltj, Oct 24 '17 at 15:43
Note that your Levenshtein Distance function can be optimised further, by using Byte Arrays instead of Strings. Give this a read:https://sysmod.wordpress.com/2012/07/27/string-fuzzy-matching-in-vba-and-vb-net/ — jeffreyweir, Nov 08 '17 at 03:38
Question: how many strings are you comparing, and how long is this currently taking? Levenshtein is computationally expensive, and you only want to use it as a last resort. I use a cascading triage logic in my fuzzy matching algorithm, that exits when it finds a suitable candidate based on direct matching, alias matching, partial substring matching, and lastly, Levenshtein. This means that I only need to pull out the 'big guns' of Levenshtein when I absolutely need it. — jeffreyweir, Nov 08 '17 at 03:41

Vityata · Accepted Answer · 2017-10-24T16:45:23.470

1

Your whole question can be replaced by "How do I use the replace function in VBA?". In general, the algorithm in the question looked interesting, thus I have done this for you. Simply add anything in the Array() of the function, it will work (Just write in lower case the values in the array):

Public Function removeSpecificWords(s As String) As String

 Dim arr     As Variant
 Dim cnt     As Long

 arr = Array("insurance", "company", "firma", "firm", "holding")
 removeSpecificWords = s

 For cnt = LBound(arr) To UBound(arr)
  removeSpecificWords = Replace(LCase(removeSpecificWords), LCase(arr(cnt)), vbNullString)
 Next cnt

End Function

Public Sub TestMe()

    Debug.Print removeSpecificWords("InsHolding")
    Debug.Print removeSpecificWords("InsuranceInsHoldingStar")

End Sub

In your case:

    S1 = removeSpecificWords(S1)
    S2 = removeSpecificWords(S2)
    valuePhrase = LevenshteinDistance(S1, S2)

edited Oct 24 '17 at 16:45

answered Oct 24 '17 at 15:42

Vityata

42,633
8
55
100

This helps so much, thank you! Follow-up question: I noticed that, once I implemented the code you've written here, I started getting more accurate data. However, there are also instances where "Company" is abbreviated to "Co", "Co.", etc. I tried adding those particular strings to the array, but this did not change the Levenshtein distance for strings which contained these particular words. Could you tell me the reason for this? – larryltj Oct 24 '17 at 16:16
@larryltj - it depends at which position you have added "Co" and "Co.". If you have added them here `Array("insurance", "company", "firma", "firm", "holding", "co")` it should not have changed. Furthermore, add them only in lowercase. Like this `co` and `co.`. – Vityata Oct 24 '17 at 16:43
Actually ignore the case, I have rewritten it, so it works. – Vityata Oct 24 '17 at 16:46

score 0 · Answer 2 · answered Oct 24 '17 at 15:46

When I had a similar issue in trying to remove duplicate addresses, I approached the problem the other way and used the Longest Common Substring.

Function DetermineLCS(source As String, target As String) As Double
    Dim results() As Long
    Dim sourceLen As Long
    Dim targetLen As Long
    Dim counter1 As Long
    Dim counter2 As Long

    sourceLen = Len(source)
    targetLen = Len(target)

    ReDim results(0 To sourceLen, 0 To targetLen)

    For counter1 = 1 To sourceLen
        For counter2 = 1 To targetLen
            If Mid$(source, counter1, 1) = Mid$(target, counter2, 1) Then
                results(counter1, counter2) = results(counter1 - 1, counter2 - 1) + 1
            Else
                results(counter1, counter2) = WorksheetFunction.Max(results(counter1, _
                        counter2 - 1), results(counter1 - 1, counter2))
            End If
        Next counter2
    Next counter1

    'return the percentage of the LCS to the length of the source string
    DetermineLCS = results(sourceLen, targetLen) / sourceLen
End Function

For addresses, I've found that about an 80% match gets me close to a hundred percent matches. with insurance agency names (and I used to work in the industry, so I know the problem you face), I might suggest a 90% target or even a mix of the Levenshtein Distance and LCS, minimizing the former while maximizing the latter.

Fuzzy string matching optimization (not checking certain words) - Excel VBA function

2 Answers2

Linked