3

I have an Excel worksheet with substrings inside each cell, delimited by #. For example:

A B C
F #G #H #I #J #K #L M#N#O#P A#B#C#D #E
F #G #H #I N#O#P A#B#C##E

How can I find the 5 most common substrings over the range of cells?

Expected output for the sample data should look like this:

Substring Count
A 2
B 2
C 2
E 2
F 2
Zev Spitz
  • 13,950
  • 6
  • 64
  • 136
  • @Harun24HR It's not returning a count for the individual cell; it's returning a count across the range. Adn eventually I want to paste it as a new worksheet. – Zev Spitz Mar 01 '21 at 06:44
  • So what would be the desired outcome here? – JvdV Mar 01 '21 at 08:15
  • @JvdV Each cell contains a set of substrings, delimited by `#`; and I want to know the most common substrings. Could you clarify what you're asking? – Zev Spitz Mar 01 '21 at 08:29
  • I'd like to know what exactly your desired output looks like so I can try to replicate it. See [ask] a quesiton with an [mcve]. – JvdV Mar 01 '21 at 08:30
  • @JvdV Could you clarify why you're asking? – Zev Spitz Mar 01 '21 at 08:34
  • Sure, it's an interesting question, but it was very minimal. Now you have added some info helping people see/understand exactly what you are after. Would you be interested in a formula solution? – JvdV Mar 01 '21 at 08:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229336/discussion-between-zev-spitz-and-jvdv). – Zev Spitz Mar 01 '21 at 08:41

2 Answers2

5

If one has access to Microsoft365's LET() function, this could be done using a combination of functions:

enter image description here

Formula in E1:

=LET(X,SORT(TRIM(FILTERXML("<t><s>"&SUBSTITUTE(TEXTJOIN("#",,A1:C2),"#","</s><s>")&"</s></t>","//s[.!='']"))),INDEX(SORT(CHOOSE({1,2},UNIQUE(X),MMULT(--(UNIQUE(X)=TRANSPOSE(X)),SEQUENCE(COUNTA(X),,,0))),2,-1),{1;2;3;4;5},{1,2}))

Explanation:

LET() allows the use of re-usable variables inside a formula. So in the above we assigned an array of values and named it "X". The array is pulled by:

=SORT(TRIM(FILTERXML("<t><s>"&SUBSTITUTE(TEXTJOIN("#",,A1:C2),"#","</s><s>")&"</s></t>","//s[.!='']")))

Where:

  • "<t><s>"&SUBSTITUTE(TEXTJOIN("#",,A1:C2),"#","</s><s>")&"</s></t>" - Is used to create a valid XML-string and;
  • "//s[.!='']" - A valid xpath to retrieve all non-empty strings.
  • TRIM() will delete any leading and trailing spaces. SORT() then sorts the array ascending.

If you want to know more about this mechanic to "split" a string into elements using FILTERXML(), you can have a read here.

Now we have a variable, we can use it in the third parameter inside LET():

=INDEX(SORT(CHOOSE({1,2},UNIQUE(X),MMULT(--(UNIQUE(X)=TRANSPOSE(X)),SEQUENCE(COUNTA(X),,,0))),2,-1),{1,2,3,4,5},{1;2})

Using INDEX() we can "slice" rows/columns from an array, where:

  • SORT(CHOOSE({1,2},UNIQUE(X),MMULT(--(UNIQUE(X)=TRANSPOSE(X)),SEQUENCE(COUNTA(X),,,0))),2,-1) - A bit of a complicated construct, but the idea here is that CHOOSE({1,2} allows for a 2D-array where the 1st column is filled with UNIQUE() values from "X" and the 2nd is the count of each unique element in the total array. The count is done using MMULT() where each unique element is counted against the TRANSPOSE()'d "X". Then the 2d-array is sorted on the 2nd column.

With the above out of the way, we can simply use INDEX() to retrieve the 5 rows we need (and both columns) using:

=INDEX(<TheAbove>,{1;2;3;4;5},{1,2})
JvdV
  • 70,606
  • 8
  • 39
  • 70
  • Could you add some description of the different pieces you're using in your answer? – Zev Spitz Mar 01 '21 at 08:52
  • @ZevSpitz, see the edit. I hope it helps. – JvdV Mar 01 '21 at 09:17
  • Thanks, much better. I would note that for me personally, as long as I have access to a programming environment (VBA or Javascript) I probably wouldn't use this technique, as I think it doesn't clearly represent what it's trying to do. – Zev Spitz Mar 01 '21 at 16:54
3

You could use a disconnected ADODB.Recordset to hold the count of each substring. Then, sort the recordset by the count in descending order, get only the top 5 rows into an array, and paste the array into another Excel worksheet.

The benefits of using a recordset here are:

  • Trivial sorting -- set the Sort property

    If the data didn't need sorting (i.e. if I just wanted the counts of the substrings), I would have gone with a Scripting.Dictionary instead of a recordset.

  • Easy conversion to an array using the GetRows method, which can then be used to populate an Excel Range object using its' Value property.

    Ideally I'd use the Excel Range.CopyFromRecordset method, which wouldn't need transposing; but I didn't find any way to limit the number of records in a recordset, or to create a duplicate with only the specified number of records.

Add a reference (Tools -> References...) to Microsoft ActiveX Data Objects Library; use the latest version (generally 6.1).

Then, you could write code like the following:

Dim rs As New ADODB.Recordset
rs.Fields.Append "Substring", adVarChar, 100
rs.Fields.Append "Count", adInteger, 5
rs.Open

Dim sheet As Worksheet
Set sheet = Worksheets("Input")

Dim cell As Range
For Each cell In sheet.UsedRange.Cells
    Dim substrings() As String
    substrings = Split(cell.Value, "#")
    
    Dim substring As Variant
    For Each substring In substrings
        substring = Trim(substring) ' remove spaces before and after substring
        If Len(substring) > 0 Then  ' ignore empty substrings
            rs.Find "Substring = '" & Replace(substring, "'", "''") & "'"
            If rs.EOF Or rs.BOF Then ' substring might not have been found, or the recordset is empty
                rs.AddNew Array("Substring", "Count"), Array(substring, 1)
                rs.Update
            Else
                rs.Update "Count", rs("Count") + 1
            End If
            rs.MoveFirst
        End If
    Next
Next

rs.Sort = "Count DESC, Substring"

Set sheet = Worksheets("Output")
sheet.Range("A1:B5").Value = Application.WorksheetFunction.Transpose(rs.GetRows(5))

Note that you could also generate a long string from the data, using the recordset's GetString method.


References:

Zev Spitz
  • 13,950
  • 6
  • 64
  • 136