0

Trying to figure out a way to calculate the minimum percentage match when comparing a string to a column.

Example:

Column A        Column B
Key             Keylime
Key Chain       Status
                Serious
                Extreme
                Key

Where

Column A        Column B     Column C                 Column D
Key             Temp         100%                     Key
Key Chain       Status       66.7%                    Key Ch
Ten             Key Ch       100%                     Tenure
                Extreme       
                Key
                Tenure 

To expand on this:

  • Column A is the column with strings to individually match
  • Column B is the reference column
  • Column C provides the highest percent match the column A string has with any string in column B.
  • Column D provides the word from column B associated with the highest percent match

To expand on Column C - when looking at Key Chain - the highest match to any word it has in column B is for Key Ch where 6 out of the 9 characters (including space) of Key Chain match to give a percentage match of (6/9) = 66.7%

  • That being said, this isn't a deal breaker but it is something that sticks out. The logic above fails when there's no way to penalize for matches where you see an example like Ten occur. Where Ten has 3 out of 3 characters that match against Tenure giving it an inflated 100% match that I still can't think of a way to correct against.
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
S31
  • 904
  • 2
  • 9
  • 21
  • 2
    This may be of interest [Levenshtein_distance on Wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance). There are examples on SO for VBA – chris neilsen Jul 31 '18 at 04:39
  • 1
    do you need Levenshtein distance or just left-of-string comparisons? – iamanigeeit Aug 03 '18 at 07:35
  • @iamanigeeit left of string comparisons! – S31 Aug 03 '18 at 08:13
  • also, should Ten / Tenure give you 50% or 100%? – iamanigeeit Aug 03 '18 at 08:19
  • @iamanigeeit given the logic in the post above, it would be 100% but I did add the last bullet point in the post note to say - that yes, the correct answer would be 50%, I'm just not sure how to take that correction into account when approaching this – S31 Aug 03 '18 at 08:24
  • If you are willing to add some `vba` (or an addon), check out Levenshtein (as per @chrisneilsen's comment). Also here (https://stackoverflow.com/questions/4243036/levenshtein-distance-in-vba) – 4lackof Aug 05 '18 at 18:48
  • @S31 just divide by the length of the longer string, as i did in my answer below. – iamanigeeit Aug 08 '18 at 08:13

1 Answers1

1

This should work (i haven't tested it, currently on Linux). Call getStrMatch for each string.

Type StrMatch
    Percent As Double
    Word As String
End Type

Function getStrMatch(s As String, RefRange As Range) As StrMatch
    Dim i As Long, ref_str As String
    Dim BestMatch As StrMatch: BestMatch.Percent = -1
    Dim match_pc As Double
    With RefRange
        For i = 1 to .Cells.Count
            ref_str = .Cells(i).Value2
            match_pc = getMatchPc(s, ref_str)
            If match_pc > BestMatch.Percent Then
                BestMatch.Percent = match_pc
                BestMatch.Word = ref_str
            End If
        Next i
    End With
    getStrMatch = BestMatch
End Function

Function getMatchPc(s As String, ref_str As String) As Double
    Dim s_len As Long: s_len = Len(s)
    Dim ref_len As Long: ref_len = Len(ref_str) 
    Dim longer_len as Long
    If s_len > ref_len Then longer_len = s_len Else longer_len = ref_len
    Dim m As Long: m = 1
    While m <= longer_len
        If Mid(s, m, 1) <> Mid(ref_str, m, 1) Then Exit While
        m = m + 1
    Wend
    getMatchPc = (m - 1.0) / longer_len
End Function

Note that you have to put this in a module or else declare Private Type and Private Function.

Also, if you're matching a lot of strings you probably should create a trie instead, as this is only doing naive string compares and each getStrMatch costs O(mn) where m is the size of the RefRange and n is the average ref_str length.

iamanigeeit
  • 784
  • 1
  • 6
  • 11