61

I have excel sheet with data which I want to get Levenshtein Distance between them. I already tried to export as text, read in from script (php), run Levenshtein (calculate Levenshtein Distance), save it to excel again.

But I am looking for a way to programatically calculate a Levenshtein Distance in VBA. How would I go about doing so?

Jsleshem
  • 715
  • 1
  • 10
  • 31
Yousf
  • 3,957
  • 3
  • 27
  • 37

4 Answers4

68

Translated from Wikipedia :

Option Explicit
Public Function Levenshtein(s1 As String, s2 As String)

Dim i As Integer
Dim j As Integer
Dim l1 As Integer
Dim l2 As Integer
Dim d() As Integer
Dim min1 As Integer
Dim min2 As Integer

l1 = Len(s1)
l2 = Len(s2)
ReDim d(l1, l2)
For i = 0 To l1
    d(i, 0) = i
Next
For j = 0 To l2
    d(0, j) = j
Next
For i = 1 To l1
    For j = 1 To l2
        If Mid(s1, i, 1) = Mid(s2, j, 1) Then
            d(i, j) = d(i - 1, j - 1)
        Else
            min1 = d(i - 1, j) + 1
            min2 = d(i, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            min2 = d(i - 1, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            d(i, j) = min1
        End If
    Next
Next
Levenshtein = d(l1, l2)
End Function

?Levenshtein("saturday","sunday")

3

smirkingman
  • 6,167
  • 4
  • 34
  • 47
  • 1
    This code works drag and drop for Access VBA too. :) – HelloW Jan 24 '14 at 14:26
  • 3
    Quick note for future users, VBA `Integer` declares *should* use less memory and be faster, but they are now automatically converted to `Long` type behind the scenes (source: [MSDN](https://msdn.microsoft.com/en-us/library/office/aa164506(v=office.10).aspx), see [this](http://stackoverflow.com/a/26409520/6609896) too). So for marginal performance increase, declaring them all as `Long` saves on the internal conversion time (some other answers I see have made use of this). OR, if your strings are under 255 characters in length, declare as `Bytes` as this requires even less memory than `Integer`. – Greedo May 21 '17 at 12:22
30

Thanks to smirkingman for the nice code post. Here is an optimized version.

1) Use Asc(Mid$(s1, i, 1) instead. Numerical comparision is generally faster than text.

2) Use Mid$ istead of Mid since the later is the variant ver. and adding $ is string ver.

3) Use application function for min. (personal preference only)

4) Use Long instead of Integers since it's what excel natively uses.

Function Levenshtein(ByVal string1 As String, ByVal string2 As String) As Long

Dim i As Long, j As Long
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long

string1_length = Len(string1)
string2_length = Len(string2)
ReDim distance(string1_length, string2_length)

For i = 0 To string1_length
    distance(i, 0) = i
Next

For j = 0 To string2_length
    distance(0, j) = j
Next

For i = 1 To string1_length
    For j = 1 To string2_length
        If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
            distance(i, j) = distance(i - 1, j - 1)
        Else
            distance(i, j) = Application.WorksheetFunction.Min _
            (distance(i - 1, j) + 1, _
             distance(i, j - 1) + 1, _
             distance(i - 1, j - 1) + 1)
        End If
    Next
Next

Levenshtein = distance(string1_length, string2_length)

End Function

UPDATE:

For those who want it: I think it's safe to say that most people use Levenshtein distance to calculate fuzzy match percentages. Here's a way to do that, and I have added an optimization that you can specify the min. match % to return (default is 70%+. You enter percentags like "50" or "80", or "0" to run the formula regardless).

The speed boost comes from the fact that the function will check if it's even possible that it's within the percentage you give it by checking the length of the 2 strings. Please note there are some areas where this function can be optimized, but I have kept it at this for the sake of readability. I concatenated the distance in result for proof of functionality, but you can fiddle with it :)

Function FuzzyMatch(ByVal string1 As String, _
                    ByVal string2 As String, _
                    Optional min_percentage As Long = 70) As String

Dim i As Long, j As Long
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long, result As Long

string1_length = Len(string1)
string2_length = Len(string2)

' Check if not too long
If string1_length >= string2_length * (min_percentage / 100) Then
    ' Check if not too short
    If string1_length <= string2_length * ((200 - min_percentage) / 100) Then

        ReDim distance(string1_length, string2_length)
        For i = 0 To string1_length: distance(i, 0) = i: Next
        For j = 0 To string2_length: distance(0, j) = j: Next

        For i = 1 To string1_length
            For j = 1 To string2_length
                If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
                    distance(i, j) = distance(i - 1, j - 1)
                Else
                    distance(i, j) = Application.WorksheetFunction.Min _
                    (distance(i - 1, j) + 1, _
                     distance(i, j - 1) + 1, _
                     distance(i - 1, j - 1) + 1)
                End If
            Next
        Next
        result = distance(string1_length, string2_length) 'The distance
    End If
End If

If result <> 0 Then
    FuzzyMatch = (CLng((100 - ((result / string1_length) * 100)))) & _
                 "% (" & result & ")" 'Convert to percentage
Else
    FuzzyMatch = "Not a match"
End If

End Function
Gaijinhunter
  • 14,587
  • 4
  • 51
  • 57
  • 3
    +1 for great optimization, but you may also want to declare the function's return type (I assume String?). – JimmyPena Nov 09 '11 at 13:28
  • Good catch - should definitely declare the return type. I'll have to try but I recall having some issues when I tried to declare it (seemed to want a variant). – Gaijinhunter Nov 09 '11 at 14:01
  • Actually, "distance" is a Long type so the return type should be Long? – JimmyPena Nov 09 '11 at 14:42
  • 11
    My version takes ~0.032 milliseconds per call. Your 'optimised' version takes ~7.937, which is about 250 times slower. Removing (the useless) Application.Screenupdating brings your time down to 0.422, only 14 times slower. Replacing your (useless) call to Worksheetfunction.min with my MIN code brings your time down to 0.032; back to where we started (ASC actually is marginally slower). – smirkingman Nov 16 '11 at 11:36
  • Thanks for the advice smirking man. I do agree the screen updating is a bad choice so I deleted it. I'll look into the others as well. – Gaijinhunter Nov 16 '11 at 11:55
  • Publish a solution that is 10% faster and I'll remove the -1 >;-) – smirkingman Nov 16 '11 at 15:56
  • Perhaps I'll try :) I've learned a lot since this answer was posted so I can definitely see where I could have improved it. :) cheers – Gaijinhunter Nov 16 '11 at 16:57
  • I've found the best optimization is deciding a min. % (like 70% matches or above) and doing a length check before running the core code since the length difference will tell you if it's impossible that it's a 70%+ match or not. – Gaijinhunter Nov 22 '11 at 00:06
  • @smirkingman Well now I'm confused....is this optimized version faster, or a lot slower? It's completely not clear, but my gut feel is that smirkingman is more correct because he includes numbers. – tbone Jun 21 '18 at 23:07
  • 4
    @tbone My comment addressed Aevenko's initial version, years ago. It seems he has updated the answer correspondingly. Best bet: test it yourself >;-) – smirkingman Jun 26 '18 at 11:33
  • 1
    Anyone still wondering, I have tested all of the functions on this post, and the winner goes to @Patrick OBeirne. – CK1 Nov 26 '21 at 04:07
  • Using the min and max worksheet functions is very slow, as noted above by smirkingman, but not changed. See here: https://www.soa.org/news-and-publications/newsletters/compact/2012/january/com-2012-iss42/excel-vba-speed-and-efficiency/ – Mark E. Apr 24 '22 at 00:01
26

Use a byte array for 17x speed gain

  Option Explicit

  Public Declare Function GetTickCount Lib "kernel32" () As Long

  Sub test()
  Dim s1 As String, s2 As String, lTime As Long, i As Long
  s1 = Space(100)
  s2 = String(100, "a")
  lTime = GetTickCount
  For i = 1 To 100
     LevenshteinStrings s1, s2  ' the original fn from Wikibooks and Stackoverflow
  Next
  Debug.Print GetTickCount - lTime; " ms" '  3900  ms for all diff

  lTime = GetTickCount
  For i = 1 To 100
     Levenshtein s1, s2
  Next
  Debug.Print GetTickCount - lTime; " ms" ' 234  ms

  End Sub

  'Option Base 0 assumed

  'POB: fn with byte array is 17 times faster
  Function Levenshtein(ByVal string1 As String, ByVal string2 As String) As Long

  Dim i As Long, j As Long, bs1() As Byte, bs2() As Byte
  Dim string1_length As Long
  Dim string2_length As Long
  Dim distance() As Long
  Dim min1 As Long, min2 As Long, min3 As Long

  string1_length = Len(string1)
  string2_length = Len(string2)
  ReDim distance(string1_length, string2_length)
  bs1 = string1
  bs2 = string2

  For i = 0 To string1_length
      distance(i, 0) = i
  Next

  For j = 0 To string2_length
      distance(0, j) = j
  Next

  For i = 1 To string1_length
      For j = 1 To string2_length
          'slow way: If Mid$(string1, i, 1) = Mid$(string2, j, 1) Then
          If bs1((i - 1) * 2) = bs2((j - 1) * 2) Then   ' *2 because Unicode every 2nd byte is 0
              distance(i, j) = distance(i - 1, j - 1)
          Else
              'distance(i, j) = Application.WorksheetFunction.Min _
              (distance(i - 1, j) + 1, _
               distance(i, j - 1) + 1, _
               distance(i - 1, j - 1) + 1)
              ' spell it out, 50 times faster than worksheetfunction.min
              min1 = distance(i - 1, j) + 1
              min2 = distance(i, j - 1) + 1
              min3 = distance(i - 1, j - 1) + 1
              If min1 <= min2 And min1 <= min3 Then
                  distance(i, j) = min1
              ElseIf min2 <= min1 And min2 <= min3 Then
                  distance(i, j) = min2
              Else
                  distance(i, j) = min3
              End If

          End If
      Next
  Next

  Levenshtein = distance(string1_length, string2_length)

  End Function
Patrick OBeirne
  • 277
  • 3
  • 2
19

I think it got even faster... Didn't do much other than improve previous code for speed and results as %

' Levenshtein3 tweaked for UTLIMATE speed and CORRECT results
' Solution based on Longs
' Intermediate arrays holding Asc()make difference
' even Fixed length Arrays have impact on speed (small indeed)
' Levenshtein version 3 will return correct percentage
'
Function Levenshtein3(ByVal string1 As String, ByVal string2 As String) As Long

Dim i As Long, j As Long, string1_length As Long, string2_length As Long
Dim distance(0 To 60, 0 To 50) As Long, smStr1(1 To 60) As Long, smStr2(1 To 50) As Long
Dim min1 As Long, min2 As Long, min3 As Long, minmin As Long, MaxL As Long

string1_length = Len(string1):  string2_length = Len(string2)

distance(0, 0) = 0
For i = 1 To string1_length:    distance(i, 0) = i: smStr1(i) = Asc(LCase(Mid$(string1, i, 1))): Next
For j = 1 To string2_length:    distance(0, j) = j: smStr2(j) = Asc(LCase(Mid$(string2, j, 1))): Next
For i = 1 To string1_length
    For j = 1 To string2_length
        If smStr1(i) = smStr2(j) Then
            distance(i, j) = distance(i - 1, j - 1)
        Else
            min1 = distance(i - 1, j) + 1
            min2 = distance(i, j - 1) + 1
            min3 = distance(i - 1, j - 1) + 1
            If min2 < min1 Then
                If min2 < min3 Then minmin = min2 Else minmin = min3
            Else
                If min1 < min3 Then minmin = min1 Else minmin = min3
            End If
            distance(i, j) = minmin
        End If
    Next
Next

' Levenshtein3 will properly return a percent match (100%=exact) based on similarities and Lengths etc...
MaxL = string1_length: If string2_length > MaxL Then MaxL = string2_length
Levenshtein3 = 100 - CLng((distance(string1_length, string2_length) * 100) / MaxL)

End Function
Apostolos55
  • 574
  • 3
  • 8