1

I recently wrote a question asking for help on how to count the number of occurrences of each and every unique pair of allergies within a population. The solutions I got were great, however I now need to look at combinations of 3+ allergies, and doing it all using Excel tables will take forever.

I decided to write a VBA script to do this, which works great for pairs. It's also much faster since I went back and changed the format of the source data so that each ExceptionID's associated AllergenID's are stored in a single comma-delimited string.

I'm now looking at moving up to a 3D or higher array, and because we don't know how many dimensions we might need to go up to (potentially 10 or 15) I would rather avoid using a series of Case or nested If/Then statements.

My research turned up this article in which I gather that what I'm asking is practically impossible, but I wanted to ask about that OP's statement that

I was thinking it would be possible to do if I could construct the Redim statement at runtime as a string and execute the string, but this doesn't seem possible.

I basically had the same idea. The code below generates a Type Mismatch error, but is there no variation of this that might work? Can we not pass other functions (like join) inside ReDim?

Sub testroutine()

Dim x As Integer, y As Integer 'just a counter
Dim PairCount() As String
Dim AllergenRef As Object 'Object to store a reference to each AllergenID using AllergenKey as key
    Set AllergenRef = CreateObject("Scripting.Dictionary")

For x = 1 To 20
    AllergenRef.Add x, (x * 10) + (2 ^ x) 'dummy data for my dictionary
Next x

Dim N_tuple As Integer
N_tuple = 5 'this value would be provided by a user form at runtime
Dim ArrayDim() As String
ReDim ArrayDim(1 To N_tuple)

For x = 1 To N_tuple

    ArrayDim(x) = "1 to " & AllergenRef.Count

Next x

ReDim PairCount(Join(ArrayDim, ",")) 'This is the line that throws an error

End Sub

This article makes it sound like what I'm doing is possible in Java, but I don't speak any Javanese so I can't really tell how similar this is to what I'm trying to achieve, or if there's a way to apply this method to VBA...

========UPDATE============
Here is a sample of the data I'm working with (in separate columns, I added dashes for clarity)

ExceptionID - ExcAllergens
035 - 100380
076 - 100107,100392,100345,100596,100141,100151,100344
200 - 100123,100200
325 - 100381
354 - 100381,100123
355 - 100381,100123
360 - 100586
390 - 100151,100344,100345,100349
441 - 100380,100368
448 - 100021,100181,100345,100200,100344,100295
491 - 100381
499 - 100333
503 - 100333
507 - 100331,100346,100596,100345,100344,100269,100283

And here is an extract from the Allergen definitions table (Allergen Key is something I just added so as to have smaller numbers to work with, the 6 digit numbers are what is used in our DB.)

AllergenKey - AllergenID - AllergenTag
01 - 100011 - Açai Berry
02 - 100012 - Acetic Acid
03 - 100013 - Agar Agar
04 - 100014 - Agave
05 - 100015 - Alcohol
06 - 100016 - Allspice
07 - 100017 - Ammonium Bicarbonate
08 - 100018 - Amylase
09 - 100019 - Annatto
10 - 100020 - Apple
11 - 100021 - Apple, Raw
12 - 100022 - Apricot
13 - 100023 - Arrowroot
14 - 100025 - Ascorbic Acid
15 - 100027 - Asparagus
16 - 100028 - Avocado
17 - 100029 - Bacterial Culture
18 - 100030 - Baking Powder

Note that there are 6810 exception profiles ranging from 1 to 51 separate allergies (around 4 or 5 on average), and 451 different allergens. Here is the result from my analysis of allergen pairs (btw when I say "Allergen" it also includes dietary preferences like vegetarian):

Top 10 pairs - Pair Count - Allergen 1 - Allergen 2
1 - 245 - Dairy - Gluten
2 - 232 - Eggs - Nuts
3 - 190 - Dairy - Eggs
4 - 173 - Gluten - Oats
5 - 146 - Soy (May Contain) - Soy
6 - 141 - Dairy - Nuts
7 - 136 - Beef - Pork
8 - 120 - Dairy - Soy
9 - 114 - Sesame (May Contain) - Nuts
10 - 111 - Vegetarian 1 - Pork

Community
  • 1
  • 1
MikeG
  • 85
  • 8
  • you cannot add dimensions to an array that is already declared. So array(10,10) cannot be redim(ed) to array(10,10,10) – Sorceri May 04 '16 at 20:50
  • Well my arrays are both declared with empty parentheses, and once the number of dimensions is set by the user it doesn't change, so that's not an issue here – MikeG May 04 '16 at 21:30
  • What you're specifically asking for doesn't work in VBA. Without knowing the structure of the data you're trying to analyze, it's difficult to recommend alternate methods that could work for you. A 15-dimensional array would be very cumbersome to work with, and in most cases is not necessary. Generally, a 3-dimensional array is the most you would need. For example, a 1-dimensional array is simply a list of values, A 2-dimensional array is a table of values, and a 3-dimensional array is several tables of values. Is there some specific reason you need more than a 3-dimensional array? – tigeravatar May 04 '16 at 21:38
  • I'm not specifically required to use multi-dimensional arrays, that's just how I have been able to visualize what I'm trying. If you have an alternative approach I'm all ears! I'm trying to count how many individuals have combinations of allergens (e.g. dairy, soy, & gluten) look for clusters. Because there are 451 allergens currently, a 1D array containing an element for each combination wouldn't really work past 5 or so dimensions. – MikeG May 04 '16 at 21:46
  • @MikeG Can you provide some sample data and expected results? – tigeravatar May 04 '16 at 21:47
  • Sure, I'll pull a sample of the source data and provide the results of my 2D script a little later on. Appreciate your help! – MikeG May 04 '16 at 22:01
  • You could also create a class to hold the information...... – Sorceri May 04 '16 at 22:08
  • Sorceri, I'm not sure what you mean by creating a class... you mean a class module? I don't think that will help much, beyond just making the code look a little neater. Or have I misunderstood what you're referring to? – MikeG May 05 '16 at 00:33
  • I would use a 2D array, the first column is a list of possible nTuples, and the second column the count. Now that you're using VBA, building the list of nTuples is fast and easy. – OldUgly May 05 '16 at 05:05
  • Or, the first n Columns define the tuple, and the n+1 column provides your count. – OldUgly May 05 '16 at 05:11

2 Answers2

1

I wouldnt' worry about the max possible combinations with your medium-sized dataset. You wont be able to make all the possible combinations. You will have many combinations that will not occur in the sample population. Do not try and calculate them all, and then count the occurrences.

Instead, work through your sample population, and create the tuples as data entries on the worksheet 'array'. I suggest using the 3-digit allergen key as identifier numbers, and combine the numbers in tuples a Long(perhaps Decimal may be needed for larger numbers).

The approach I suggest is to combine the tuples as longs which can be easily be decomposed later. Then use the frequency function to count the occurrences of each tuple 'number'. so if there are allergens with keys: 1, 17, 451 - they form a composed long of 1,017,451 (identical to 451, 17, & 1)- we ensure that any tuple has forced order of smallest key to largest key. So the max triple is 449,450,451, and the smallest is 1,002,003. Note that you can NEVER have 3,002,001 as that would duplicate 1,002,003.

The module I had a play with is below: EDIT - for better code

Option Explicit
Option Base 1

Public Function concID(paramArr() As Variant) As Variant
' this function takes an array of numbers and arranges the array into
' one long code number - with order of smallest to largest
' the code number generated has each individual array entry as a 3-digit component

  Dim wsf As WorksheetFunction
  Dim decExp As Integer
  Dim i As Long, j As Long
  Dim bigNum As Variant   ' may need to cast to Decimal??

  Set wsf = WorksheetFunction

  'may use cDec if necessary here??
  For i = 1 To UBound(paramArr)
        'determine the position of the component by multiplying by a multiple of 10^3
        decExp = 3 * (UBound(paramArr) - i)
        bigNum = bigNum + wsf.Small(paramArr, i) * 10 ^ decExp
  Next i
  concID = bigNum

End Function

Public Sub runAllergen()

  Dim ws As Worksheet
  Dim dataRange As Range, tupleRange As Range, uniqueList As Range, freqRange As Range, r As Range
  Dim i As Long, j As Long, counter As Long
  Dim dataArray As Variant, arr As Variant, tempholder As Long
  Dim bigArray(1 To 10 ^ 6, 1 To 1) As Variant ' the array which will hold all the generated combinations from the data
  Dim tuple As Long

  tuple = 3
  'this will come in as a user input.
  Set ws = Sheet1
  Set dataRange = ws.Range("A2:A10001")     'I have 10k people in my dataset, and this is just the allergen data vector

  Application.ScreenUpdating = False  'IMPORTANT for efficiency

  tempholder = 1 'this is the array index which the next combi entry is to be put into bigArray
  dataArray = dataRange.Value 'write entire worksheet column to internal array for efficiency
  For i = 1 To UBound(dataArray)
        'obtain array of allergen values in each data row to obtain tuples from
        arr = Split(dataArray(i, 1), ",")
        If UBound(arr) + 1 >= tuple Then
              'give over the array of row data to make tuples from and write to bigArray
              'return the next available index of bigArray to store data
              tempholder = printCombinations(arr, tuple, bigArray(), tempholder)
        End If
  Next i

  Set r = ws.Range("B2")
  'write entire list of tuples from data population to worksheet for efficiency - MASSIVE performance boost
  r.Resize(tempholder - 1, 1).Value = bigArray
  'copy tuple output over to another column to remove duplicates and get unique list
  Set tupleRange = ws.Range(r, r.End(xlDown))
  tupleRange.Copy
  Set r = ws.Range("D2")
  r.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
    :=False, Transpose:=False

  'remove duplicates from copied tuple output to get a unique list of codes to serve as bins in FREQUENCY function
  ws.Range(r, r.End(xlDown)).RemoveDuplicates Columns:=1, Header:=xlNo
  Set uniqueList = ws.Range(r, r.End(xlDown))
  Application.CutCopyMode = False
  'set the frquency output range which is always 1 more row than the bins array
  Set freqRange = uniqueList.Offset(0, 1).Resize(uniqueList.Rows.Count + 1, 1)
  'get the frequency of each tuple
  freqRange.FormulaArray = "=FREQUENCY(R2C" & tupleRange.Column & ":R" & tupleRange.Rows.Count + 1 & _
                    "C" & tupleRange.Column & _
                    ",R2C" & uniqueList.Column & ":R" & uniqueList.Rows.Count + 1 & "C" & uniqueList.Column & ")"

  Application.ScreenUpdating = True
End Sub

Public Function printCombinations(pool As Variant, r As Long, printVector As Variant, tempPosition As Long) As Long

  'this function writes the data row arrays as tuples/combis to the bigArray,
  'and returns the next available index in bigArray
  Dim i As Long, j As Long, n As Long
  Dim tempholder() As Variant
  Dim idx() As Long

  ReDim tempholder(1 To r)
  ReDim idx(1 To r)

  n = UBound(pool) - LBound(pool) + 1
  For i = 1 To r
        idx(i) = i
  Next i

  Do
        For j = 1 To r
              tempholder(j) = CLng(pool(idx(j) - 1))
        Next j

        'we now have an array of size tuple from the row data, so construct our code number,
        'and write to the next available index in bigArray

        printVector(tempPosition, 1) = concID(tempholder)
        tempPosition = tempPosition + 1

        ' Locate last non-max index
        i = r
        While (idx(i) = n - r + i)
              i = i - 1
              If i = 0 Then
                    'the algorithm has ended with the last index exhausted
                    'return the next available index of bigArray
                    printCombinations = tempPosition
                    Exit Function
              End If
        Wend

        idx(i) = idx(i) + 1
        For j = i + 1 To r
              idx(j) = idx(i) + j - i
        Next j
  Loop

End Function

Initial set-up:

enter image description here

You could also copy-paste over your frequency Range into values etc....

MacroMarc
  • 3,214
  • 2
  • 11
  • 20
  • Thanks Marc, that approach makes so much more sense than the way I had originally started down! Even doing it this way takes a very long time at 5 tuples, I can't imagine what it would have been otherwise. I didn't fully understand everything you were doing in your code, so I ended up writing my own version. Probably less efficient than yours, but at least I can troubleshoot it! Thanks for all your help! – MikeG May 09 '16 at 14:51
  • Hey Mike, I liked the exercise. I'll comment it and clean it up later. i think it's natural for everybody to think of full combination calculation, and then allocate. I've experienced this alternative in the past for similar problems of huge combination/permutations, and its always worth thinking of formulating data first, then decompose - as opposed to 'constructing' data structures when most of the construction won't be used. – MacroMarc May 09 '16 at 15:09
  • Hi Mike, I've tidied the code up a bit, and added comments so you can follow. It is running a lot more efficiently now - much faster. – MacroMarc May 09 '16 at 19:48
0

To expand on my comment, here is some modified code to use an array of arrays based on the provided N_tuple variable. I am having a difficult time imagining a scenario where this wouldn't work for you:

Sub testroutine()

Dim x As Integer, y As Integer 'just a counter
Dim ArrayTemp() As Variant
Dim PairCount() As Variant
Dim AllergenRef As Object 'Object to store a reference to each AllergenID using AllergenKey as key
    Set AllergenRef = CreateObject("Scripting.Dictionary")

For x = 1 To 20
    AllergenRef.Add x, (x * 10) + (2 ^ x) 'dummy data for my dictionary
Next x

Dim N_tuple As Integer
N_tuple = 5 'this value would be provided by a user form at runtime

'Now that you have your N_tuple, redim your paircount array
ReDim PairCount(1 To N_tuple)

'For each N_tuple, create an array and add it to the PairCount array
'Note that you could easily have a 2-dimensional array for a table of values as ArrayTemp
For x = 1 To N_tuple
    ReDim ArrayTemp(1 To AllergenRef.Count)
    PairCount(x) = ArrayTemp
Next x

'Now you have an array of arrays, which can be easily accessed.
'For example: PairCount(2)(3)
'Or if the subarrays are 2-dimensional: PairCount(4)(6, 12)

'This simply loops through the PairCount array and shows the ubound of its subarrays
For x = 1 To UBound(PairCount)
    MsgBox UBound(PairCount(x))
Next x

End Sub
tigeravatar
  • 26,199
  • 5
  • 30
  • 38
  • Interesting idea, I'll have to give some thought to how I can make this fully scalable, but I think you might have the answer there. – MikeG May 04 '16 at 21:59
  • hmm... I just ran a quick nCr computation to see what I'm getting myself into, for n=451 and r = 6 there are 1.36E15 possible combinations. Something tells me my laptop won't be able to handle an array that large regardless of how it's structured. I may have to take the top 20 (or whatever) results at each level and iteratively run the count for each additional dimension, again paring down to the top 20 before proceeding. – MikeG May 05 '16 at 02:20