0

This is the third and final remaining problem to a massive data cleaning task I have been working on for over a year. Thank you Stack Overflow community for helping figure out:

Problem 1- Index multiple columns and Match distinct values....
Problem 2- Count unique values that match ID, optimized for 100,000+ cases.

I'm not 100% sure if the following is achievable in excel, but I'll do my best to describe the data cleaning and organization challenge I'm faced with.

I have a series of data markers/attributes that are in a random order across 24 columns, spanning 500,000+ rows. Image 1 below is an example of what the data looks like in raw form, presented across 12 columns and spanning 22 rows for illustrative simplicity. Columns A through L contain the raw data and Columns M through X represent the desired output.

Image 1:

SUMMARY OF THE TASK: What needs to be accomplished is a series of matching functions that search through all indexed columns (in this case columns A through L) to identify unique values (e.g. 1), search for the value in range (in this case A2:L21 range), identify the adjacent values to the unique value (for value 1, adjacent values are 2 and 13-XR), then output them in a descending sequence from most frequently occurring value to least frequently occurring in each row that contains any of the values in question (in this case, 1 occurs 5 times and is placed in M2 through M6; 2 occurs 3 times and is placed in N2 through N6; and 13-XR occurs 2 times and is placed in O2 through O6).

To clarify, below is a step by step description using colours to illustrate the pattern matching in the raw data (columns A through L) and how these patterns should then presented in the output (columns M through X). I've sectioned off each of the following images into the six patterns that are in the raw data.

Image 2:

The above image is the first pattern that would be identified by the VBA solution. It would identify "1" as a unique value and search through the A:L range for number of instances of "1" (highlighted in blue), then identify all the values that can be found adjacent in the same row: "2" in rows 3, 5, and 6 (highlighted in green); and "13-XR" in rows 4 and 5 (highlighted in pink). This would then need to be done for "2", identifying the adjacent values ("1" and "13-XR"), and then for "13-XR", identifying ("1" and "2" as adjacent values). The output would return the unique values with the most frequently occurring in Column M ("1" occurs 5 times), then the second most occurring in Column N ("2" occurs 3 times), and the third most occurring in Column O ("13-XR" occurs 2 times).

Image 3:

The above is little more complex. The VBA would identify "3" as a unique value, search through the A:L range for other instances of "3" and identify all the values that are adjacent to it (in this case, "4", "7", and "9"). It would then do the same for "4", identifying all adjacent values (only "3"); then for "7", identifying adjacent values ("9", "3", and "12"); then for "9" identifying ("7", and "3"); and finally, for "12" identifying adjacent values (only "7"). Then for each row where any of these values are present, the output would return a "3" in column M (occurring three times) and a "7" in column N (also occurring three times); if counts are equal, they could be presented in ascending fashion A to Z or smallest to largest... or just random, the ordering of equal counts is arbitrary for my purposes. "9" would be returned in column O as it occurs two times, then "4" in column P and "12" in column Q, as they both occur once but 12 is greater than 4.

Image 4:

The above image represents what is likely to be a common occurrence, where there is only one unique value. Here, "5" is not identified in any other columns in the range. It is thus returned as "5" in column M for each row where a "5" is present.

Image 5:

This will be another of the more common occurrences, where one value may be present in one row and two values present in another row. In this instance "6" is only identified once in the range and "8" is the only adjacent value found. When "8" is searched for it only returns one instance of an adjacent value "6". Here, "8" occurs twice and "6" only once, thus resulting in "8" imputed in column M and "6" imputed in column N wherever an "8" or a "6" are present in the row.

Image 6:

Here "10", "111", "112", "543", "433", "444", and "42-FG" are identified as unique values associated with one another in the A:L range. All values except "10" occur twice, which are returned in columns M through S in descending order.

Image 7:

This final pattern is identified in the same manner as above, just with more unique values (n=10).

FINAL NOTES: I have no idea how to accomplish this within excel, but I'm hoping someone else has the knowledge to move this problem forward. Here are some additional notes about the data that might help towards a resolution:

  • The first column will always be sorted in ascending order. I can do additional custom sorts if it simplifies things.
  • Out of the ~500,000 rows, 15% only have one attribute value (one value in column A), 30% have two attribute values (1 value in col A & 1 value in col B), 13% have three attribute values (1 value in col A, B, & C).
  • I have presented small numbers in this example. The actual raw data values in each cell will be closer to 20 characters in length.
  • A solution that does everything except present the patterns in descending order would be absolutely cool. The sorting would be great but I can live without it if it causes too much trouble.

If anything in this description needs further clarification, or if I can provide additional information, please let me know and I'll adjust as needed.

Thanks in advance to anyone who can help solve this final challenge of mine.

ADDENDUM:

There was a memory error happening with the full data set. @ambie figured out the source of the error was adjacent chains (results) numbering in the 1000s (trying to return results across 1000s of columns). Seems the problem is not with the solution or the data, just hitting a limitation within excel. A possible solution to this is (see image below) to add two new columns (ATT_COUNT as column M; ATT_ALL as column Z). ATT_COUNT in Column M would return the total number of unique values that would ordinarily be returned across columns. Only up to the top 12 most frequently occurring values would be returned in columns N through Y (ATT_1_CL through ATT_12_CL). To get around the instances where ATT_COUNT is > 12 (& upwards of 1000+), we can return all the unique values in space delimited format in ATT_ALL (column Z). For example, in the image below, rows 17, 18, 19, and 21, have 17 unique values in the chain. Only the first 12 most frequently occurring values are presented in columns N through Y. All 17 values are presented in space delimited format in column Z.

image 8

Here is a link to this mini example test data.

Here is a link to a mid sized sample of test data of ~50k rows.

Here is a link to the full sized sample test data of ~500k rows.

Community
  • 1
  • 1
sociologix
  • 69
  • 1
  • 8

1 Answers1

2

We don't normally provide a 'code for you service' but I know in previous questions you have provided some sample code that you've tried, and I can see how you wouldn't know where to start with this.

For your future coding work, the trick is to break the problem down into individual tasks. For your problem, these would be:

  1. Identify all the unique values and acquire a list of all the adjacent values - fairly simple.
  2. Create a list of 'chains' which link one adjacent value to the next - this is more awkward because, although the list appears sorted, the adjacent values are not, so a value relatively low down in the list might be adjacent to a higher value that is already part of a chain (the 3 in your sample is an example of this). So the simplest thing would be to assign the chains only after all the unique values have been read.
  3. Map of each unique value to its appropriate 'chain' - I've done this by creating an index for the chains and assigning the relevant one to the unique value.

Collection objects are ideal for you because they deal with the issue of duplicates, allow you to populate lists of an unknown size and make value mapping easy with their Key property. To make the coding easy to read, I've created a class containing some fields. So first of all, insert a Class Module and call it cItem. The code behind this class would be:

Option Explicit

Public Element As String
Public Frq As Long
Public AdjIndex As Long
Public Adjs As Collection

Private Sub Class_Initialize()
    Set Adjs = New Collection
End Sub

In your module, the tasks could be coded as follows:

Dim data As Variant, adj As Variant
Dim uniques As Collection, chains As Collection, chain As Collection
Dim oItem As cItem, oAdj As cItem
Dim r As Long, c As Long, n As Long, i As Long, maxChain As Long
Dim output() As Variant

'Read the data.
'Note: Define range as you need.
With Sheet1
    data = .Range(.Cells(2, "A"), _
                  .Cells(.Rows.Count, "A").End(xlUp)) _
           .Resize(, 12) _
           .Value2
End With

'Find the unique values
Set uniques = New Collection
For r = 1 To UBound(data, 1)
    For c = 1 To UBound(data, 2)
        If IsEmpty(data(r, c)) Then Exit For
        Set oItem = Nothing: On Error Resume Next
        Set oItem = uniques(CStr(data(r, c))): On Error GoTo 0
        If oItem Is Nothing Then
            Set oItem = New cItem
            oItem.Element = CStr(data(r, c))
            uniques.Add oItem, oItem.Element
        End If
        oItem.Frq = oItem.Frq + 1
        'Find the left adjacent value
        If c > 1 Then
            On Error Resume Next
            oItem.Adjs.Add uniques(CStr(data(r, c - 1))), CStr(data(r, c - 1))
            On Error GoTo 0
        End If
        'Find the right adjacent value
        If c < UBound(data, 2) Then
            If Not IsEmpty(data(r, c + 1)) Then
                On Error Resume Next
                oItem.Adjs.Add uniques(CStr(data(r, c + 1))), CStr(data(r, c + 1))
                On Error GoTo 0
            End If
        End If
    Next
Next

'Define the adjacent indexes.
For Each oItem In uniques
    'If the item has a chain index, pass it to the adjacents.
    If oItem.AdjIndex <> 0 Then
        For Each oAdj In oItem.Adjs
            oAdj.AdjIndex = oItem.AdjIndex
        Next
    Else
        'If an adjacent has a chain index, pass it to the item.
        i = 0
        For Each oAdj In oItem.Adjs
            If oAdj.AdjIndex <> 0 Then
                i = oAdj.AdjIndex
                Exit For
            End If
        Next
        If i <> 0 Then
            oItem.AdjIndex = i
            For Each oAdj In oItem.Adjs
                oAdj.AdjIndex = i
            Next
        End If
        'If we're still missing a chain index, create a new one.
        If oItem.AdjIndex = 0 Then
            n = n + 1
            oItem.AdjIndex = n
            For Each oAdj In oItem.Adjs
                oAdj.AdjIndex = n
            Next
        End If
    End If
Next

'Populate the chain lists.
Set chains = New Collection
For Each oItem In uniques
    Set chain = Nothing: On Error Resume Next
    Set chain = chains(CStr(oItem.AdjIndex)): On Error GoTo 0
    If chain Is Nothing Then
        'It's a new chain so create a new collection.
        Set chain = New Collection
        chain.Add oItem.Element, CStr(oItem.Element)
        chains.Add chain, CStr(oItem.AdjIndex)
    Else
        'It's an existing chain, so find the frequency position (highest first).
        Set oAdj = uniques(chain(chain.Count))
        If oItem.Frq <= oAdj.Frq Then
            chain.Add oItem.Element, CStr(oItem.Element)
        Else
            For Each adj In chain
                Set oAdj = uniques(adj)
                If oItem.Frq > oAdj.Frq Then
                    chain.Add Item:=oItem.Element, Key:=CStr(oItem.Element), Before:=adj
                    Exit For
                End If
            Next
        End If
    End If
    'Get the column count of output array
    If chain.Count > maxChain Then maxChain = chain.Count
Next

'Populate each row with the relevant chain
ReDim output(1 To UBound(data, 1), 1 To maxChain)
For r = 1 To UBound(data, 1)
    Set oItem = uniques(CStr(data(r, 1)))
    Set chain = chains(CStr(oItem.AdjIndex))
    c = 1
    For Each adj In chain
        output(r, c) = adj
        c = c + 1
    Next
Next

'Write the output to sheet.
'Note: adjust range to suit.
Sheet1.Range("M2").Resize(UBound(output, 1), UBound(output, 2)).Value = output

This isn't the most efficient way of doing it, but it does make each task more obvious to you. I'm not sure I understood the full complexities of your data structure, but the code above does reproduce your sample, so it should give you something to work with.

Update

Okay, now I've seen your comments and the real data, below is some revised code which should be quicker and deals with the fact that the apparently 'empty' cells are actually null strings.

First of all create a class called cItem and add code behind:

Option Explicit

Public Name As String
Public Frq As Long
Public Adj As Collection
Private mChainIndex As Long
Public Property Get ChainIndex() As Long
    ChainIndex = mChainIndex
End Property
Public Property Let ChainIndex(val As Long)
    Dim oItem As cItem
    If mChainIndex = 0 Then
        mChainIndex = val
        For Each oItem In Me.Adj
            oItem.ChainIndex = val
        Next
    End If            
End Property
Public Sub AddAdj(oAdj As cItem)
    Dim t As cItem

    On Error Resume Next
    Set t = Me.Adj(oAdj.Name)
    On Error GoTo 0
    If t Is Nothing Then Me.Adj.Add oAdj, oAdj.Name
End Sub
Private Sub Class_Initialize()
    Set Adj = New Collection
End Sub

Now create another class called cChain with code behind as:

Option Explicit

Public Index As Long
Public Members As Collection
Public Sub AddItem(oItem As cItem)
    Dim oChainItem As cItem
    With Me.Members
        Select Case .Count
            Case 0 'First item so just add it.
                .Add oItem, oItem.Name
            Case Is < 12 'Fewer than 12 items, so add to end or in order.
                Set oChainItem = .item(.Count)
                If oItem.Frq <= oChainItem.Frq Then 'It's last in order so just add it.
                    .Add oItem, oItem.Name
                Else 'Find its place in order.
                    For Each oChainItem In Me.Members
                        If oItem.Frq > oChainItem.Frq Then
                            .Add oItem, oItem.Name, before:=oChainItem.Name
                            Exit For
                        End If
                    Next
                End If
            Case 12 'Full list, so find place and remove last item.
                Set oChainItem = .item(12)
                If oItem.Frq > oChainItem.Frq Then
                    For Each oChainItem In Me.Members
                        If oItem.Frq > oChainItem.Frq Then
                            .Add oItem, oItem.Name, before:=oChainItem.Name
                            .Remove 13
                            Exit For
                        End If
                    Next
                End If
        End Select
    End With
End Sub
Private Sub Class_Initialize()
    Set Members = New Collection
End Sub

Finally, your module code would be:

Option Explicit

Public Sub ProcessSheet()
    Dim data As Variant
    Dim items As Collection, chains As Collection
    Dim oItem As cItem, oAdj As cItem
    Dim oChain As cChain
    Dim txt As String
    Dim r As Long, c As Long, n As Long
    Dim output() As Variant
    Dim pTick As Long, pCount As Long, pTot As Long, pTask As String

    'Read the data.
    pTask = "Reading data..."
    Application.StatusBar = pTask
    With Sheet1
        data = .Range(.Cells(2, "A"), _
                      .Cells(.Rows.Count, "A").End(xlUp)) _
               .Resize(, 12) _
               .Value2
    End With

    'Collect unique and adjacent values.
    pTask = "Finding uniques "
    pCount = 0: pTot = UBound(data, 1): pTick = 0
    Set items = New Collection
    For r = 1 To UBound(data, 1)
        If ProgressTicked(pTot, pCount, pTick) Then
            Application.StatusBar = pTask & pTick & "%"
            DoEvents
        End If
        For c = 1 To UBound(data, 2)
            txt = data(r, c)
            If Len(txt) = 0 Then Exit For
            Set oItem = GetOrCreateItem(items, txt)
            oItem.Frq = oItem.Frq + 1

            'Take adjacent on left.
            If c > 1 Then
                txt = data(r, c - 1)
                If Len(txt) > 0 Then
                    Set oAdj = GetOrCreateItem(items, txt)
                    oItem.AddAdj oAdj
                End If
            End If
            'Take adjacent on right.
            If c < UBound(data, 2) Then
                txt = data(r, c + 1)
                If Len(txt) > 0 Then
                    Set oAdj = GetOrCreateItem(items, txt)
                    oItem.AddAdj oAdj
                End If
            End If

        Next
    Next

    'Now that we have all the items and their frequencies,
    'we can find the adjacent chain indexes by a recursive
    'call of the ChainIndex set property.
    pTask = "Find chain indexes "
    pCount = 0: pTot = items.Count: pTick = 0
    Set chains = New Collection
    n = 1 'Chain index.
    For Each oItem In items
        If ProgressTicked(pTot, pCount, pTick) Then
            Application.StatusBar = pTask & pTick & "%"
            DoEvents
        End If
        If oItem.ChainIndex = 0 Then
            oItem.ChainIndex = n
            Set oChain = New cChain
            oChain.Index = n
            chains.Add oChain, CStr(n)
            n = n + 1
        End If
    Next

    'Build the chains.
    pTask = "Build chains "
    pCount = 0: pTot = items.Count: pTick = 0
    For Each oItem In items
        If ProgressTicked(pTot, pCount, pTick) Then
            Application.StatusBar = pTask & pTick & "%"
            DoEvents
        End If
        Set oChain = chains(CStr(oItem.ChainIndex))
        oChain.AddItem oItem
    Next

    'Write the data to our output array.
    pTask = "Populate output "
    pCount = 0: pTot = UBound(data, 1): pTick = 0
    ReDim output(1 To UBound(data, 1), 1 To 12)
    For r = 1 To UBound(data, 1)
        If ProgressTicked(pTot, pCount, pTick) Then
            Application.StatusBar = pTask & pTick & "%"
            DoEvents
        End If
        Set oItem = items(data(r, 1))
        Set oChain = chains(CStr(oItem.ChainIndex))
        c = 1
        For Each oItem In oChain.Members
            output(r, c) = oItem.Name
            c = c + 1
        Next
    Next

    'Write the output to sheet.
    'Note: adjust range to suit.
    pTask = "Writing data..."
    Application.StatusBar = pTask
    Sheet1.Range("M2").Resize(UBound(output, 1), UBound(output, 2)).Value = output
    Application.StatusBar = "Ready"
End Sub

Private Function GetOrCreateItem(col As Collection, key As String) As cItem
    Dim obj As cItem

    'If the item already exists then return it,
    'otherwise create a new item.
    On Error Resume Next
    Set obj = col(key)
    On Error GoTo 0

    If obj Is Nothing Then
        Set obj = New cItem
        obj.Name = key
        col.Add obj, key
    End If

    Set GetOrCreateItem = obj

End Function
Public Function ProgressTicked(ByVal t As Long, ByRef c As Long, ByRef p As Long) As Boolean
    c = c + 1
    If Int((c / t) * 100) > p Then
        p = p + 1
        ProgressTicked = True
    End If
End Function
Ambie
  • 4,872
  • 2
  • 12
  • 26
  • 1
    I can't thank you enough for this annotated solution and the logic walk through @ambie. It's really helping me understand what the process should look like. It took me 3 days to write this question out and try to formulate the jumble in my head into something somewhat cohesive; thank you for taking the time to work through it. I'm still an infant with VBA, I haven't got it working in my test file yet but I'll figure it out. In the regular module, am I correct to assume that `Set Sheet1 = Worksheets("Sheet1")` would need to be specified prior to `With Sheet1`? – sociologix Jul 13 '17 at 19:04
  • Excellent. I have it running. Added `Sub uniqueOrder()` and removed the `Set Sheet1 = Worksheets("Sheet1")` I had added and it's working perfectly on my test file. I'll report back on functionality with larger test. – sociologix Jul 13 '17 at 19:13
  • Running with (i7-2600K CPU @ 3.40GHz; 16GB RAM; 64 bit WIN 10; Excel 2016), 10,000 rows and 12 columns of data completes in ~5 seconds. At 200,000 rows it runs into a memory issue, identifying `ReDim output(1 To UBound(data, 1), 1 To maxChain)` in the code. I've looked into some solutions to avoid this error and @arne-larsson 's [workaround solution](https://stackoverflow.com/questions/14396998/how-to-clear-memory-to-prevent-out-of-memory-error-in-excel-vba) "After every 10,000 I made the code save the workbook" seems like it might solve things. I'll keep digging. – sociologix Jul 13 '17 at 20:49
  • If you're having memory issues, it'll most likely be the output array. It could either be that you're exceeding the max array size or your system limits. The workaround would be to iterate the data array in chunks of a certain size (10,000, if that works for you) in the final loop and write the array a chunk at a time. You just need to keep a note of which row you've reached each chunk. If you could provide a link to your data, I'll adjust the sample code for you. – Ambie Jul 14 '17 at 05:18
  • I can't thank you enough for the help. I shall pay it forward in kind. Here is a link to the full ~500,000 rows of data: [test data download](https://ln.sync.com/dl/d5efed730/47esdz6u-y77kpq8k-t7si26fs-zzybuwew) – sociologix Jul 17 '17 at 16:33
  • I've been trying to develop an answer for you but either I've misunderstood your rules for developing the adjacent chains or there's a flaw in your logic. The cause of the out of memory exception is that the adjacent chains can be thousands long. Can you have another look at your logic and see if there's a rule that's missing or needs to be refined? – Ambie Jul 30 '17 at 14:31
  • I'm thinking that the problem could be "blank" cells (blanks that may not be entirely clean) that are being picked up and triggering a mass return of data cells attributed with the errant "blanks". I've put together two squeaky clean datasets that include only cells with data, [one larger](https://ln.sync.com/dl/ea874a410/vffc7c4y-hkwqn3e6-94qp8zra-d3cr5fen) (12 columns & ~500k rows) and [one smaller](https://ln.sync.com/dl/8f1d23d70/7f853ren-9av2jgrj-mdaptcfy-gb72aide) (5 columns & ~50k rows). I'm hoping this is the problem. I'll work through a 10k test manually to double check the the logic. – sociologix Jul 30 '17 at 19:03
  • Ok, I think I've figured out a solution to the issue. After exploring your solution through incrementally more complex data sets, as far as I can tell, your formula is doing what it should. The problem is the complexity of my data, where adjacent chains can be 1000s long (though most will likely be <20). I've added an addendum above in the original post that will hopefully solve things. 1) count the uniques, 2) only output <= top 12 values, 3) output all uniques as space delimited. I have detailed this above with data link and image example. Thanks again @ambie this will change my life. – sociologix Jul 30 '17 at 22:16
  • Ok, I've amended the code. I haven't bothered space delimiting the output but I'm sure you can adjust to suit. – Ambie Aug 01 '17 at 12:11
  • Many many thanks. I've tried the revised code on the three data sets linked above and get different errors with each. With the StackOver_mini_TEST file, it's kicking back a "subscript out of range error" pointing to `Set oItem = items(data(r, 1))`. With the StackOver_TEST_cleaned dataset, it's kicking back an "out of stack space" error, pointing to `BuildChain items, oSubItem, oChain`. With the ..._SMALL_TEST_cleaned, I adjust the ranges and it returns a compile error directed at the final public function. Sorry, I'm feeling pretty out of my depth with this. I'll keep trying to find a solution – sociologix Aug 01 '17 at 18:54
  • It works on your original data. Use that one and let me know how it goes. – Ambie Aug 02 '17 at 02:57
  • Shoot, with the original StackOver_TEST I had posted I'm getting the same "out of stack space" error, pointing to `BuildChain items, oSubItem, oChain` line of code. And it's crashing excel upon exiting the debugger. – sociologix Aug 02 '17 at 06:07
  • That error means there's a recursion problem. I've done one final adjustment to the code (see classes and module). If you're still having problems then it's probably time to to seek some professional help with your data. – Ambie Aug 08 '17 at 17:16
  • This is amazing, thank you @Ambie. It works perfect for the StackOver_TEST_cleaned file where I cleaned up the blank cells. It still reports back an error with the original data once the 'Find chain indexes process' reaches 100%, with a `Object variable or With block variable not set` run-time error 91; pointing to `oChain.AddItem oItem` in the module. I'm ecstatic that it's working with the clean data, it's just amazing. I'll be tinkering with this over the next little while and will be sure to post any updates or revised code. – sociologix Aug 14 '17 at 18:30