1

I don't know how to search for this or how to explain without an example.

I'm looking for an excel function that compares cell strings and identifies the portion they have in common.

Conditions

  • Compares 2 or more cells.
  • The common string is identified as long as 2 cells share it. *(ie: if comparing more than 2, it's enough to have 2 cells with that string. Not all the compared cells need to have it.) *
  • The string has at least 3 or more chars to avoid single characters and pairs being flagged.

Example

----------------------------------------------------------------------
|  Pattern  | Page URL 1           | Page URL 2     | Page URL 3     |
----------------------------------------------------------------------
|    test   | example.net/test/    | www.test.com   | www.notest.com |  
----------------------------------------------------------------------
|   q=age   | another.com?q=age    | test.com/q=age | test.com/q=lol |
----------------------------------------------------------------------

Probably obvious by now, but what I'm trying to achieve/analyze is if there are string patterns that are common to large sets of URLs.

(forgive my poor attempt trying to draw a table)

Community
  • 1
  • 1
  • 2
    https://stackoverflow.com/questions/3576211/string-similarity-algorithms – Vityata Jun 06 '18 at 12:41
  • 1
    Unclear..... do you want **all** matching sub-strings or only **one** ?? – Gary's Student Jun 06 '18 at 12:48
  • What do you mean _identifies the portion they have in common_? Do you mean the pattern column are the answers returned by the function? – Brian Jun 06 '18 at 12:56
  • The attempt at the table was not bad. There are tools to help you like [this](https://www.tablesgenerator.com/markdown_tables) – QHarr Jun 06 '18 at 13:01
  • Do you want the longest string common between at least two URLs, or do you want the string (of length at least 3) that can be found in the most URLs, or do you want a balance between the two (e.g., determined by some formula, like the product of the string length and number of occurrences)? – jblood94 Jun 06 '18 at 13:51
  • @Gary'sStudent apologies, I want at least one matching sub-string. – pedroteixeira07 Jun 06 '18 at 22:31
  • @Brian yes, the pattern is the answer returned by the function, in better words as Gary notes: the matching sub-substring. – pedroteixeira07 Jun 06 '18 at 22:32
  • @QHarr many thanks for the tip - precious ;) – pedroteixeira07 Jun 06 '18 at 22:32
  • @jblood94 the second, the string (of length at least 3) that can be found in the most URLs. – pedroteixeira07 Jun 06 '18 at 22:33
  • Apologies guys, I knew it wouldn't be straightforward to explain my goal but now just realized it's really challenging. I think @jblood94 interpretation is the closest: The sub-string (at least 3 characters) that can be found in most URLs. The goal ultimatly is to find sub-string patterns between several URLs but bearing in mind there might be expections when there's no similarity, therefore "most" URLs. Hope this clears a bit a further. – pedroteixeira07 Jun 06 '18 at 22:39
  • 1
    @Vityata thanks for sharing! I had a look and it's not exactly what I need. Rather than "similarity" I want exact character matches. – pedroteixeira07 Jun 06 '18 at 22:41

2 Answers2

1

This doesn't fully answer the question but I think it will give you what you need to get it. Give it a try. Place the following code in a new moudule:

Public Sub FindStrings()
    Dim rng1 As Excel.Range
    Dim rng2 As Excel.Range

    Set rng1 = ActiveSheet.Range("A1")
    Set rng2 = ActiveSheet.Range("A2")

    Dim i As Integer
    Dim j As Integer
    Dim searchVal As String
    For i = 3 To Len(rng2)
        For j = 1 To Len(rng1)
            searchVal = Mid(rng1, j, i)
            If Len(searchVal) < i Then Exit For
            If InStr(1, rng2, searchVal) Then Debug.Print searchVal
        Next j
    Next i
End Sub

In cell A1 put example.net/test
In cell A2 put www.test.com

Result

tes
est
test

UPDATE

I updated the code to search for a minimum of 4 characters instead of 3 (as you mentioned above). Furthermore, I guessed you wouldn't want strings such as www. and .com returned, nor strings with the / or . character. So the code pulls those out as well. Also, it compares every column combination.

Option Explicit
Public Sub CompareStrings()
    Dim Arr As Variant
    Dim i As Integer
    Dim j As Integer
    Dim StartRange As Excel.Range
    Dim SearchRange As Excel.Range
    Dim Counter As Integer
    Dim ComparableRange As Variant
    Dim Comparable As Integer
    Dim Compared As Integer
    Dim SearchVal As String

    Set StartRange = ActiveSheet.Range("A1")

    Counter = 0
    For Each ComparableRange In ActiveSheet.Range("A1:A2")
    Set SearchRange = Range(StartRange.Offset(Counter), Cells(StartRange.Offset(Counter).Row, Columns.Count).End(xlToLeft))
    Arr = Application.Transpose(Application.Transpose(SearchRange.Value))
    Debug.Print "Row " & SearchRange.Row & ":"
        For j = LBound(Arr) To UBound(Arr)
            For i = j + 1 To UBound(Arr)
                For Comparable = 4 To Len(Arr(j))
                    For Compared = 1 To Len(Arr(i))
                        SearchVal = Mid(Arr(j), Compared, Comparable)
                        If InStr(1, SearchVal, ".") = 0 Then
                            If InStr(1, SearchVal, "/") = 0 Then
                                If Len(SearchVal) < Comparable Then Exit For
                                If InStr(1, Arr(i), SearchVal) > 0 Then Debug.Print vbTab & SearchVal
                            End If
                        End If
                    Next Compared
                Next Comparable
            Next i
        Next j
        Counter = Counter + 1
    Next ComparableRange    
End Sub

When comparing test.com/q=age with another.com?q=age You will still get results such as:

q=ag
=age 
q=age 

...though I suspect you only want the third one. The longer the matching strings are the more results you will get. The last results are the ones you will probably want.

Brian
  • 2,078
  • 1
  • 15
  • 28
  • what I'd be looking to get from your example would be "test" – pedroteixeira07 Jun 06 '18 at 22:36
  • Hey there - sorry to ping you here.. I noticed you [suggested an edit](https://stackoverflow.com/review/suggested-edits/19968206) to add "thank you" and "please help" and add an incorrect tag while removing a perfectly appropriate one - please note the [tag:macros] tag has nothing to do with [tag:excel-vba] or [tag:vba]. Please avoid tagging VBA questions with [tag:macros] (see tag info). Also "thank you" and "please help" is typically noise that gets weeded out, not added in. ...thank you! – Mathieu Guindon Jun 08 '18 at 13:52
  • @MathieuGuindon On this question or a different one? The "thank you" was added because there were too few edits to save (just formatting with few character strokes). I don't remember changing tags (I normally don't). I have only edited twice, I think, over the past two days and someone edited before me. Maybe the tag was part of their edit? – Brian Jun 08 '18 at 13:59
  • IDK, I rejected the edit specifically because the [excel-vba] tag was being removed and the incorrect [macros] tag was being added - just wanted to make sure you weren't going on an edit spree to sprinkle [macros] across the [vba] tag... spent quite a lot of time getting rid of that misplaced tag a few years ago. – Mathieu Guindon Jun 08 '18 at 14:03
  • @MathieuGuindon Thank you Mathieu! I'm somewhat familiar with Rubberduck for VBA IDE. I'll keep that in mind. – Brian Jun 08 '18 at 14:30
1

Copy the following code into a module. Read the comments at the top of CommonString for usage.

Option Explicit

Public Function CommonString(rng As Range, iMinLen As Integer, Optional strDelimiter As String = ",") As String
    'Finds the maximum number of cells (iMax) in "rng" that have a common substring of length at least "iMinLen".
    'The function returns a string with the format "iMax: substring1,substring2,substring3..."
    ' where substring1, substring2, etc. are unique substrings found in exactly iMax cells.
    'The output does not include any substrings of the unique substrings.
    'The delimter between substrings can be specified by the optional parameter "strDelimiter".
    'If no common substrings of length at least "iMinLen" are found, "CommonString" will return an empty string.
    Dim blnRemove() As Boolean
    Dim dicSubStrings As Object 'records the number of times substrings are found in pairwise string comparisons
    Dim iCandidates As Integer
    Dim iCol As Integer
    Dim iCurrCommon As Integer
    Dim iCurrLen As Integer
    Dim iMax As Integer
    Dim iMaxCommon As Integer
    Dim iNumStrings As Integer
    Dim iOutCount As Integer
    Dim iRow As Integer
    Dim iString1 As Integer
    Dim iString2 As Integer
    Dim iSubStr1 As Integer
    Dim iSubStr2 As Integer
    Dim lngSumLen As Long
    Dim str1D() As String
    Dim strCandidates() As String
    Dim strOut() As String
    Dim strSim() As String
    Dim strSub As String
    Dim vKey As Variant
    Dim vStringsIn() As Variant

    Set dicSubStrings = CreateObject("Scripting.Dictionary")
    vStringsIn = rng.Value
    iNumStrings = Application.CountA(rng)
    ReDim str1D(1 To iNumStrings)
    ' pull the strings into a 1-D array
    For iRow = 1 To UBound(vStringsIn, 1)
        For iCol = 1 To UBound(vStringsIn, 2)
            iCurrLen = Len(vStringsIn(iRow, iCol))

            If iCurrLen > 0 Then
                iString1 = iString1 + 1
                str1D(iString1) = vStringsIn(iRow, iCol)
                lngSumLen = lngSumLen + iCurrLen
            End If
        Next iCol
    Next iRow
    'initialize the array that will hold the substrings to output
    ReDim strOut(1 To lngSumLen - iNumStrings * (iMinLen - 1))
    'find common substrings from all pairwise combination of strings
    For iString1 = 1 To iNumStrings - 1
        For iString2 = iString1 + 1 To iNumStrings
            strSim = Sim2Strings(str1D(iString1), str1D(iString2), iMinLen)
            'loop through all common substrings
            For iSubStr1 = 1 To UBound(strSim)
                If dicSubStrings.Exists(strSim(iSubStr1)) Then
                    iCurrCommon = dicSubStrings(strSim(iSubStr1)) + 1
                    dicSubStrings(strSim(iSubStr1)) = iCurrCommon
                    If iCurrCommon > iMaxCommon Then iMaxCommon = iCurrCommon
                Else    'add common substrings to the "dicSubStrings" dictionary
                    dicSubStrings.Add strSim(iSubStr1), 1
                    If iMaxCommon = 0 Then iMaxCommon = 1
                End If
            Next iSubStr1
        Next iString2
    Next iString1

    If dicSubStrings.Count = 0 Then Exit Function
    ReDim strCandidates(1 To dicSubStrings.Count)
    'add the candidate substrings to the "strCandidates" array
    'candidate substrings are those found in exactly "iMaxCommon" pairwise comparisons
    For Each vKey In dicSubStrings.keys
        If dicSubStrings(vKey) = iMaxCommon Then
            iCandidates = iCandidates + 1
            strCandidates(iCandidates) = CStr(vKey)
        End If
    Next vKey

    ReDim blnRemove(1 To iCandidates)
    iOutCount = iCandidates
    'keep only the candidate substrings that are not a substring within another candidate substring
    For iSubStr1 = 1 To iCandidates - 1
        If Not blnRemove(iSubStr1) Then
            For iSubStr2 = 1 To iCandidates - 1
                If Not blnRemove(iSubStr2) Then
                    If Len(strCandidates(iSubStr1)) <> Len(strCandidates(iSubStr2)) Then
                        If Len(strCandidates(iSubStr1)) > Len(strCandidates(iSubStr2)) Then
                            If InStr(strCandidates(iSubStr1), strCandidates(iSubStr2)) > 0 Then
                                blnRemove(iSubStr2) = True
                                iOutCount = iOutCount - 1
                            End If
                        Else
                            If InStr(strCandidates(iSubStr2), strCandidates(iSubStr1)) > 0 Then
                                blnRemove(iSubStr1) = True
                                iOutCount = iOutCount - 1
                            End If
                        End If
                    End If
                End If
            Next iSubStr2
        End If
    Next iSubStr1

    ReDim strOut(1 To iOutCount)
    iOutCount = 0
    'add the successful candidates to "strOut"
    For iSubStr1 = 1 To iCandidates
        If Not blnRemove(iSubStr1) Then
            iOutCount = iOutCount + 1
            strOut(iOutCount) = strCandidates(iSubStr1)
        End If
    Next iSubStr1
    'convert "iMaxCommon" (pairwise counts) to number of cells (iMax) by solving the formula:
    '(iMax ^ 2 - iMax) / 2 = iMaxCommon
    iMax = ((8 * iMaxCommon + 1) ^ 0.5 + 1) / 2
    CommonString = iMax & ": " & Join(strOut, strDelimiter)
End Function

Private Function Sim2Strings(str1 As String, str2 As String, iMinLen As Integer) As String()
    'Returns a list of unique substrings common to both "str1" and "str2" that
    ' have a length of at least "iMinLen".
    Dim dicInList As Object
    Dim iCharFrom As Integer
    Dim iLen1 As Integer
    Dim iSearchLen As Integer
    Dim iSubStr As Integer
    Dim strCurr As String
    Dim strList() As String
    Dim vKey As Variant

    iLen1 = Len(str1)
    Set dicInList = CreateObject("Scripting.Dictionary")
    'add common substrings to the "dicInList" dictionary
    For iCharFrom = 1 To iLen1 - iMinLen + 1
        For iSearchLen = iMinLen To iLen1 - iCharFrom + 1
            strCurr = Mid(str1, iCharFrom, iSearchLen)

            If InStr(str2, strCurr) = 0 Then
                Exit For
            Else
                If Not dicInList.Exists(strCurr) Then
                    dicInList.Add strCurr, 0
                End If
            End If
        Next iSearchLen
    Next iCharFrom

    If dicInList.Count = 0 Then
        ReDim strList(0)
    Else
        ReDim Preserve strList(1 To dicInList.Count)
        'output the keys in the "dicInList" dictionary to the "strList" array
        For Each vKey In dicInList.keys
            iSubStr = iSubStr + 1
            strList(iSubStr) = vKey
        Next vKey
    End If

    Sim2Strings = strList
End Function
jblood94
  • 10,340
  • 1
  • 10
  • 15