3

I have a list of items and I want to identify their similarity in relation to the other items in this list.

My desired output would be something along the lines of: enter image description here

The percentage shown in the similarity column is purely illustrative. I'm thinking that a test for similarity would be something along the lines of:

number of concurrent letters / by the total number of letters in the matched item

But would be keen to get opinions on that one.

Is this something which is reasonably doable on Excel? I'ts a small data set (140kb) containing only alphanumeric values.

Am also open to alternative ways of approaching this, as I haven't tackled anything like this before!

P.s. I've been learning Python for a few months now, so suggestions using Python would also be good!

Community
  • 1
  • 1
Maverick
  • 789
  • 4
  • 24
  • 45

3 Answers3

2

Here is a solution using a VBA UDF:

EDIT: Added a new optional argument named arg_lMinConsecutive which is used to determine the minimum number of consecutive characters that must match. Note the extra argument 2 in the below formulas which indicates that at least 2 consecutive characters must match.

Public Function FuzzyMatch(ByVal arg_sText As String, _
                           ByVal arg_vList As Variant, _
                           ByVal arg_lOutput As Long, _
                           Optional ByVal arg_lMinConsecutive As Long = 1, _
                           Optional ByVal arg_bMatchCase As Boolean = True, _
                           Optional ByVal arg_bExactCount As Boolean = True) _
                As Variant

    Dim dExactCounts As Object
    Dim aResults() As Variant
    Dim vList As Variant
    Dim vListItem As Variant
    Dim sLetter As String
    Dim dMaxMatch As Double
    Dim lMaxIndex As Long
    Dim lResultIndex As Long
    Dim lLastMatch As Long
    Dim i As Long
    Dim bMatch As Boolean

    If arg_lMinConsecutive <= 0 Then
        FuzzyMatch = CVErr(xlErrNum)
        Exit Function
    End If

    If arg_bExactCount = True Then Set dExactCounts = CreateObject("Scripting.Dictionary")

    If TypeName(arg_vList) = "Collection" Or TypeName(arg_vList) = "Range" Then
        ReDim aResults(1 To arg_vList.Count, 1 To 3)
        Set vList = arg_vList
    ElseIf IsArray(arg_vList) Then
        ReDim aResults(1 To UBound(arg_vList) - LBound(arg_vList) + 1, 1 To 3)
        vList = arg_vList
    Else
        ReDim vList(1 To 1)
        vList(1) = arg_vList
        ReDim aResults(1 To 1, 1 To 3)
    End If

    dMaxMatch = 0#
    lMaxIndex = 0
    lResultIndex = 0

    For Each vListItem In vList
        If vListItem <> arg_sText Then
            lLastMatch = -arg_lMinConsecutive
            lResultIndex = lResultIndex + 1
            aResults(lResultIndex, 3) = vListItem
            If arg_bExactCount Then dExactCounts.RemoveAll
            For i = 1 To Len(arg_sText) - arg_lMinConsecutive + 1
                bMatch = False
                sLetter = Mid(arg_sText, i, arg_lMinConsecutive)
                If Not arg_bMatchCase Then sLetter = LCase(sLetter)
                If arg_bExactCount Then dExactCounts(sLetter) = dExactCounts(sLetter) + 1

                Select Case Abs(arg_bMatchCase) + Abs(arg_bExactCount) * 2
                    Case 0
                        'MatchCase is false and ExactCount is false
                        If InStr(1, vListItem, sLetter, vbTextCompare) > 0 Then bMatch = True

                    Case 1
                        'MatchCase is true and ExactCount is false
                        If InStr(1, vListItem, sLetter) > 0 Then bMatch = True

                    Case 2
                        'MatchCase is false and ExactCount is true
                        If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString, Compare:=vbTextCompare)) >= dExactCounts(sLetter) Then bMatch = True

                    Case 3
                        'MatchCase is true and ExactCount is true
                        If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString)) >= dExactCounts(sLetter) Then bMatch = True

                End Select

                If bMatch Then
                    aResults(lResultIndex, 1) = aResults(lResultIndex, 1) + WorksheetFunction.Min(arg_lMinConsecutive, i - lLastMatch)
                    lLastMatch = i
                End If
            Next i
            If Len(vListItem) > 0 Then
                aResults(lResultIndex, 2) = aResults(lResultIndex, 1) / Len(vListItem)
                If aResults(lResultIndex, 2) > dMaxMatch Then
                    dMaxMatch = aResults(lResultIndex, 2)
                    lMaxIndex = lResultIndex
                End If
            Else
                aResults(lResultIndex, 2) = 0
            End If
        End If
    Next vListItem

    If dMaxMatch = 0# Then
        Select Case arg_lOutput
            Case 1:     FuzzyMatch = 0
            Case 2:     FuzzyMatch = vbNullString
            Case Else:  FuzzyMatch = CVErr(xlErrNum)
        End Select
    Else
        Select Case arg_lOutput
            Case 1:     FuzzyMatch = Application.Min(1, aResults(lMaxIndex, 2))
            Case 2:     FuzzyMatch = aResults(lMaxIndex, 3)
            Case Else:  FuzzyMatch = CVErr(xlErrNum)
        End Select
    End If

End Function

Using only the original data in columns A and B, you can use this UDF to get the desired results in columns C and D:

enter image description here

In cell C2 and copied down is this formula:

=FuzzyMatch($B2,$B$2:$B$6,COLUMN(A2),2)

In cell D2 and copied down is this formula:

=IFERROR(INDEX(A:A,MATCH(FuzzyMatch($B2,$B$2:$B$6,COLUMN(B2),2),B:B,0)),"-")

Note that they both use the FuzzyMatch UDF.

tigeravatar
  • 26,199
  • 5
  • 30
  • 38
  • Thanks for this, really appreciate it! This actually fits perfectly for another project I'm working. However, not quite exactly what I'm trying to do for this job. I'm trying to match concurrent letters. Rather than just matching occurrences. So in the example above Lemon should equal 0%. Is this code adaptable to do that? – Maverick May 10 '17 at 13:08
  • 1
    @Maverick Even for concurrent letters, the "e" and "n" will give at least a single match. Do you mean that at minimum two consecutive letters have to match? – tigeravatar May 10 '17 at 13:11
  • yeah I just thought that. I'd say at least 2 and probably no more than 5-10, but it would good to be able to adjust that if possible? – Maverick May 10 '17 at 13:14
  • @Maverick I added a new optional argument in the UDF: `arg_lMinConsecutive` and have updated the answer. Please note the extra argument in the formulas, the `2` at the end, which indicates that a minimum of two consecutive characters must match. You can customize this to whatever number you want, as long as it is a positive whole number. If this argument is omitted, the formula will assume a MinConsecutive of 1 (the original behavior). – tigeravatar May 10 '17 at 13:40
  • Slight update to UDF to reduce duplicate code using the new `bMatch` variable (no change in end results, this was just for code optimization) – tigeravatar May 10 '17 at 13:52
  • Thank you for that, works perfectly! Checked out your RPG as well, for both! – Maverick May 10 '17 at 13:54
1

In python you can use Levenshtein distance to get the results. Check out this answer:

Fuzzy string comparison in Python, confused with which library to use

Community
  • 1
  • 1
Chris Rouffer
  • 743
  • 5
  • 14
1

I really did not get the whole logic, but if you need the logic for the 100% here is it:

Option Explicit

Sub TestMe()

    Dim rngCell         As Range
    Dim rngCell2        As Range
    Dim lngTotal        As Long
    Dim lngTotal2       As Long
    Dim lngCount        As Long

    For Each rngCell In Sheets(1).Range("A1:A5")
        For Each rngCell2 In Sheets(1).Range("A1:A5")
            If rngCell.Address <> rngCell2.Address Then
                If InStr(1, rngCell, rngCell2) Then
                    rngCell.Offset(0, 1) = 1
                Else
                    If InStr(1, rngCell2, rngCell) Then
                        rngCell.Offset(0, 2) = Round(CDbl(Len(rngCell) / Len(rngCell2)), 2)
                    End If
                End If
            End If
        Next rngCell2
    Next rngCell

End Sub

Here you go with the pic:

enter image description here

Vityata
  • 42,633
  • 8
  • 55
  • 100
  • Thank you for that, really appreciate your help! I'm trying to match words which have concurrent letters. So If I had Lemon, Lemons and Yellow Lemons in 3 separate rows, I would like to quickly identify which have contain the word Lemon. So in that example each one would match 100%, then I will quickly convert them all to Lemon in order to remove duplicates which are the same just entered in differently. Does that make sense? – Maverick May 09 '17 at 15:20
  • Thanks @Vityata, really appreciate that! Just to confirm, the first column returns 100% matches and the second returns partial matches. Is that right? – Maverick May 09 '17 at 15:45
  • I also have the text I want to test in column D and my references in column A if that makes a difference to you? – Maverick May 09 '17 at 15:50