2

I am working with alphanumeric data from a mainframe. Due to the nature of the access point, the GetString method is used within a webbrowser interface to pull data from the mainframe. I am refactoring my code as well as older code to make use of data structures instead of merely range objects, as range object code takes far longer with large data sets.

As a part of general optimization practice, I run all large data set macros with Application.ScreenUpdating = False and Application.Calculation = xlCalculationManual active. To time it, I use QueryPerformanceCounter with a DoEvents after using the Counter in conjunction with the statusbar, so that it provides me the time it takes to complete a particular macro. The QueryPerformanceCounter is located in a Class Module and has played no direct role in executing the domain logic / business logic of my code.

For instance, I recently refactored code that pulled 10,000 or so strings from the mainframe screen and placed them into a worksheet via a loop. When refactored into a datastructure loop, the code takes around 70 seconds when shucking the strings into an array. The code is also more portable, in that those strings could as easily be shifted/placed to a dictionary for sorting or a collection for parsing. I am therefore switching all my VBA code from range-based to datastructures, and this is the lead-in/background for my question.

I came across some older code during an analysis project that has some interesting logic for pulling content from the mainframe. In essence, the code pulls content from the server in this layout form:

Raw Data Pulled From Server Into Excel Sheet

And then parses the the content into this form in an excel sheet using Worksheet/Cell logic as a framework:

Data Parsed from Server into Excel Sheet

The code, sans the login/access logic as well as sans subroutine declarations, is as follows:

Sub AcquireData()

    CurrentServerRow = 13

    WhileLoopHolder = 1

    If Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)) <> "" Then

        NewWorksheetLine_Sub

    End If

    Do While WhileLoopHolder = 1

        If CurrentSession.Screen.Getstring(CurrentServerRow, 9, 1) = "-" Then

            If Trim(CurrentSession.Screen.Getstring(CurrentServerRow + 1, 15, 1)) <> "" Then

                NewWorksheetLine_Sub

            End If

        ElseIf Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)) = "" Then

            If Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14)) <> "" Then
                Cells(WorksheetRow, ValueSets) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14))
                ValueSets = ValueSets + 1
            End If

        Else

            If CurrentSession.Screen.Getstring(CurrentServerRow, 5, 1) = "" Then

                Cells(WorksheetRow, WorksheetColumn) = "X"

            Else

                Cells(WorksheetRow, WorksheetColumn) = CurrentSession.Screen.Getstring(CurrentServerRow, 5, 1)

            End If

            Cells(WorksheetRow, WorksheetColumn + 1) = CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)
            Cells(WorksheetRow, WorksheetColumn + 2) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 17, 39))
            Cells(WorksheetRow, ValueSets) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14))
            WorksheetColumn = WorksheetColumn + 3
            ValueSets = ValueSets + 1

        End If

        CurrentServerRow = CurrentServerRow + 1

        If CurrentServerRow > 41 Then

            WhileLoopHolder = 0

        End If

    Loop

End Sub

Sub NewWorksheetLine_Sub()

        WorksheetRow = WorksheetRow + 1
        WorksheetColumn = 1
        ValueSets = 10

End Sub

This code is nested in a loop within another program, and thereby pulls thousands of lines and organizes them neatly. It also takes hours and wastes valuable time that could be used analyzing the data acquired from the server. I managed to refactor the basic code into a data structure, and used my learning to refactor other code as well. Unfortunately, I refactored this particularly code incorrectly, as I am unable to mimic the business logic correctly. My snippet is as follows:

Sub AcquireData()
'This code refactors the data into a datastructure from a range object, but does not really capture the logic.
'Also, There is an error in attempting to insert a variant array into a collection/dictionary data structure.


CurrentServerRow = 13

ReDim SourceDataArray(10)

WhileLoopHolder = 1

If Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)) <> "" Then

    NewWorksheetLine_Sub

End If

Do While WhileLoopHolder = 1

    If CurrentSession.Screen.Getstring(CurrentServerRow, 9, 1) = "-" Then

        If Trim(CurrentSession.Screen.Getstring(CurrentServerRow + 1, 15, 1)) <> "" Then

            NewWorksheetLine_Sub

        End If

    ElseIf Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)) = "" Then

        If Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14)) <> "" Then

            ReDim Preserve SourceDataArray(ValueSets)
            SourceDataArray(ValueSets) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14))

            ValueSets = ValueSets + 1
            ReDim Preserve SourceDataArray(ValueSets)
        End If

    Else

        If CurrentSession.Screen.Getstring(CurrentServerRow, 5, 1) = "" Then

            ReDim Preserve SourceDataArray(WorkSheetColumn)
            SourceDataArray(WorkSheetColumn) = "X"

        Else

            SourceDataArray(WorkSheetColumn) = CurrentSession.Screen.Getstring(CurrentServerRow, 5, 1)

        End If

        SourceDataArray(WorkSheetColumn + 1) = CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)

        SourceDataArray(WorkSheetColumn + 2) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 17, 39))

        SourceDataArray(ValueSets) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14))

        WorkSheetColumn = WorkSheetColumn + 3
        ValueSets = ValueSets + 1
        ReDim Preserve SourceDataArray(ValueSets)

    End If

    CurrentServerRow = CurrentServerRow + 1

    If CurrentServerRow > 41 Then

        WhileLoopHolder = 0

    End If

Loop

End Sub

Sub NewWorksheetLine_Sub()

SourceIndexAsString = SourceCollectionIndex

   SourceDataCollection.Add SourceDataArray(), SourceIndexAsString

    SourceCollectionIndex = SourceCollectionIndex + 1
    WorkSheetColumn = 1
    ValueSets = 10

End Sub

I have considered that in order to use the same type of "cell" logic, I may want to use arrays nested within an array, and then transpose that to a worksheet. However, I have been thus far unsuccessful in implementing any such solution these past few weeks. Also, there may be a superior method of refactoring the logic to a datastructure form. However, I have been unable to determine how to do so successfully.

To summarize, my questions are as follows: In what way(s) can I shift "cell"-based logic to data structure logic? What is the best data structure for doing so? In this particular case, how can I implement the use of data structure logic with the this business logic?

Community
  • 1
  • 1
Scott Conover
  • 1,421
  • 1
  • 14
  • 27
  • At first glance it looks like you could split it into 3 simple functions 1)Fill down blanks 2) Transpose range (may need to break into smaller ranges) 3) Delete unwanted rows. –  Sep 06 '12 at 18:32
  • I do not think that would be effective, since I am attempting to parse thousands of rows into an array form from a range-based form. The range-based form works fine - I am merely attempt to refactor to array logic to speed processing before transposing/moving the array to the spreadsheet. How would filling down blanks, transposing the ranges when I am avoiding range objects for the loop and deleting unwanted rows impact this process? – Scott Conover Sep 06 '12 at 18:37
  • If it takes hours to parse out data from the mainframe, wouldn't it be better to get the mainframe to output the data in the format required first (eg using SQL)? –  Sep 06 '12 at 19:18
  • No mainframe access - thus the nature of the access point being a screen scrape. Using a PDO SQL Query with Read Access to the Mainframe Database would be great and fast - but it is not available, especially given the distribution of this particular program. – Scott Conover Sep 06 '12 at 19:21
  • 2
    If you take the original code run it with screenupdating off and calculation set to manual, how much faster is it? It it literally takes hours, then it seems like the delay is in the "scraping" and not in dealing with the scraped data. What's happening in "GetString()" ? – Tim Williams Sep 06 '12 at 21:08
  • Tim, I have `Application.ScreenUpdating = False` and `Application.Calculation = xlCalculationManual` on all my revised applications; I will add that to my post - I consider it a part of [rudimentary optimization](http://www.cpearson.com/excel/optimize.htm) for large data sets. The scraping frequently processes approximately 20-50 entries per second when optimized with the use of data structures, if not faster in some cases, such as my pull of 10,000 single string entries. The `GetString()` function appears to be a method alike that of the System Extra! library set. – Scott Conover Sep 06 '12 at 21:18

3 Answers3

1

Some of the use of ReDim Preserve seems problematic.

If CurrentSession.Screen.Getstring(CurrentServerRow, 5, 1) = "" Then
  ReDim Preserve SourceDataArray(WorkSheetColumn)
  SourceDataArray(WorkSheetColumn) = "X"

So if WorksheetColumn had the value 1 we would have reduced SourceDataArray to being one entry in size and discarded all of the data in the higher locations in the array.

Else
  SourceDataArray(WorkSheetColumn) = CurrentSession.Screen.Getstring(CurrentServerRow, 5, 1)
End If

SourceDataArray(WorkSheetColumn + 1) = CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)

SourceDataArray(WorkSheetColumn + 2) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 17, 39))

Now we are potentially looking at entries in SourceDataArray which don't exist (i.e. when the If branch above was followed rather than the Else branch) and we should get a "Subscript out of range" error

ReDim Preserve only retains the data for array elements which make sense with the new array size. So if we have ReDim a(10) and then later have ReDim Preserve a(5) (and assume that arrays start at element 0 - i.e. no Option Base 1) then a(5) through a(9) now are inaccessible and the data they contained is lost

barrowc
  • 10,444
  • 1
  • 40
  • 53
  • Barrowc, yes, your point is well made. An offset function of some sort may be appropriate to implement with the ReDim Preserve, in order to ensure that the resizing takes place effectively. Notably, this code resides within a loop set, and at the conclusion of that loop I attempt to pass the entire set of parsed data for that array to a collection. The intent is that additional "columns" are added when more data is present during the loop, and that when a new "row" is added, I attempt to shuck the data into a collection and begin again. There are likely better approaches however - or a fix. – Scott Conover Sep 07 '12 at 14:24
1

To refactor the code that uses cell references into an array you need to use a 2 dimensional array.
Cell references are 1 based, so you should stick to that in your array too.

You can copy Ranges to and from arrays using the Range.Value property

' Range to array
Dim a as Variant
a = Range("A1:J100").Value

will result in a being a variant array of size 1 To 100, 1 To 10

' Array to Range
Dim a(1 To 100, 1 To 10) as Variant
' populate a
' ...
' Put a into a range
Range("A1:J100").Value = a

These two code snippets result in the same output, but the second runs much faster

Dim r as Long, c as Long
For r = 1 To 1000
For c = 1 To 100
    Cells(r, c) = r * c
Next c, r


Dim r as Long, c as Long
Dim a() as Variant 
Redim a(1 To 1000, 1 To 100)   
For r = 1 To 1000
For c = 1 To 100
    a(r, c) = r * c
Next c, r
Range("A1:CV1000") = a

ReDim Preserve is a relatively expensive operation, so it's faster to ReDim in chunks

Rather than this

Redim a(1 To 10, 1 To 1)
For 1 = 1 to 100000
    Redim Preserve a(1 To 10, 1 To i)
    a(i) = SomeValue
Next

Do this instead

Redim a(1 To 10, 1 To 1000)
For 1 = 1 to 100000
    If i > UBound(a) Then
        Redim Preserve a(1 To 10, 1 To UBound(a) + 1000)
    End If
    a(i) = SomeValue
Next
Redim Preserve a (1 To 10, 1 To i - 1)

Redim Preserve can only change the last dimension of a multi dimensional array.

Eg This works

Redim a(1 to 100, 1 To 10)
Redim Preserve a(1 to 100, 1 To 20)

This does not work

Redim a(1 to 100, 1 To 10)
Redim Preserve a(1 to 200, 1 To 20)

Usually when working with arrays representing ranges, its the number of rows that varies most. This presents a problem, since the Range.Value array is (1 To Rows, 1 To Columns)

A work around is to actually dimension your array (1 To Columns, 1 To Rows). Redim number of rows as required, then Transpose into the destination range

Dim r As Long, c As Long
Dim a() As Variant
ReDim a(1 To 100, 1 To 200)
For r = 1 To 1000
For c = 1 To 100
    If r > UBound(a, 2) Then
        ReDim Preserve a(1 To UBound(a, 1), 1 To UBound(a, 2) + 200)
    End If
    a(c, r) = r * c
Next c, r
Range("A1:CV1000") = Application.Transpose(a)

If you need to vary both dimensions, to change the first dimension will require creating a new array of the required size and copying the data from the old array to the new one. Again, redim like this in chunks to avoid too many redim's

One last thing: you don't seem to Dim your variable (unless you've just left this part out of you post). I would reccomend you use Option Explicit and Dim all your variables. This helps to avoid data type mistakes, and also avoids using Variant for everything. Variants are fine when you need then, but when you don't, other data types are usually faster.

chris neilsen
  • 52,446
  • 10
  • 84
  • 123
  • Chris, a few clarifications: I dim all my variables and declare `Option Explicit`; I require Option Explicit in order to ensure that I have defined all my variables at compile time. The variable declaration code resides in the declaration section above the subroutine, which is not localized so I can focus on testing the data structure switch-over, after which I can refactor to scope my variables properly. I use the `Range.Value` in a subroutine after this code. I did not include code I considered extraneous to the issue, which may have been an error in this case - despite the question length. – Scott Conover Sep 07 '12 at 14:09
  • Also, I think that I will need to look at my question over the weekend and consider shortening it for readability. The reason being that my core issue is not basic optimization of range values, nor am I redimming in chunks, thus permitting me to lose any values stored within the array and shucking the values before redimming to the spreadsheet. I am looping through a series of records on a mainframe vis a vis a web browser interface, attempting to parse then place that information in an array, and then shucking that entire range to the spreadsheet. I do not think chunking would be optimal. – Scott Conover Sep 07 '12 at 14:12
  • I think the core issue for me is factoring in both the use of array logic, as well as the business logic/domain logic of the original coding approach of range-based logic. In a way, I believe you have answered the question as to the issue of the array. I think that there are essentially two ways to approach sizing arrays in VBA - resizing as you go or declaring at the beginning and trimming at the end. In this case, I should probably get an idea of how potentially large such a data set can be, size it substantially above that, and then trim at the end. The business logic is the tricky part. – Scott Conover Sep 07 '12 at 14:15
  • @JackOrangeLantern assuming you are in fact using `Option Explicit` you are causing similar difficulties by using public variables everywhere for everyone reading the code – enderland Sep 12 '12 at 02:57
  • Enderland, can you explain what you mean - in what way do public variables conflict with Option Explicit? Also, These are not public between modules, just public in the sense that the variables are declared in the declarations sections. As for Option Explicit, I use it with nearly every module of code I write - and I use it with this one. It is essential to checking conflicting variable errors at compile time. Currently, I am focused on the business logic and will post a solution once I implement my new changes over the next week. I am refactoring other macros as well. – Scott Conover Sep 12 '12 at 14:11
0

Once I spent a few weeks refactoring other macros from range-based logic to abstracted data structure logic, the answer hit me once I returned to this macro. If I am merely mimicking the range logic so as to more quickly complete the macro, then I need only fill the array such that it matches the range once it is transposed. This means that I do not need to trim the array or in any way manipulate its form - I only need to fill the data structure in array form, and then transpose it to the spreadsheet. I can also make alternative use of the data once the array is filled up.

Here is the solution code:

Sub AcquireData()

'The array 'MyArray' was dimensioned as a dynamic array in the declarations section at the top of the module.
'Redim the array to a big 2 dimensional array that fits the needs of the data/macro.
ReDim MyArray(1 To 20, 1 To 20000)

'From here on, simply mimic the logic of the range macro... [i]
CurrentServerRow = 13

WhileLoopHolder = 1

If Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)) <> "" Then

    NewWorksheetLine_Sub

End If

Do While WhileLoopHolder = 1

    If CurrentSession.Screen.Getstring(CurrentServerRow, 9, 1) = "-" Then

        If Trim(CurrentSession.Screen.Getstring(CurrentServerRow + 1, 15, 1)) <> "" Then

            NewWorksheetLine_Sub

        End If

    ElseIf Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)) = "" Then

        If Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14)) <> "" Then

            '[i] ... except, move the values into the array in Column, Row logic form.
            MyArray(ValueSets, WorksheetRow) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14))
            ValueSets = ValueSets + 1
        End If

    Else

        If CurrentSession.Screen.Getstring(CurrentServerRow, 5, 1) = "" Then

            MyArray(WorksheetColumn, WorksheetRow) = "X"

        Else

            MyArray(WorksheetColumn, WorksheetRow) = CurrentSession.Screen.Getstring(CurrentServerRow, 5, 1)

        End If

        MyArray(WorksheetColumn + 1, WorksheetRow) = CurrentSession.Screen.Getstring(CurrentServerRow, 9, 7)
        MyArray(WorksheetColumn + 2, WorksheetRow) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 17, 39))
        MyArray(ValueSets, WorksheetRow) = Trim(CurrentSession.Screen.Getstring(CurrentServerRow, 58, 14))
        WorksheetColumn = WorksheetColumn + 3
        ValueSets = ValueSets + 1

    End If

    CurrentServerRow = CurrentServerRow + 1

    If CurrentServerRow > 41 Then

        WhileLoopHolder = 0

    End If

Loop

ArrayToWorkSheet_Sub

End Sub

Sub NewWorksheetLine_Sub()

    WorksheetRow = WorksheetRow + 1
    WorksheetColumn = 1
    ValueSets = 10

End Sub

'When finished with the loop, push the array to the worksheet, and transpose it to provide the correct column to row relationship in the spreadsheet.
Sub ArrayToWorkSheet_Sub()

Dim ArrayLimit As Long

Dim LastCell As Long

Dim MyRange As Range

'This level of precision in setting the range appears unnecessary, but in theory I think it could speed up tranposing the array - [ii]
'[ii]but that is just speculation. Performance improvements for the tranposition appear to be minor, perhaps due to the fact that [iii]
'[iii]most - if not nearly all - of the intense computations occur earlier.
With Sheets("Sheet2")

ArrayLimit = UBound(MyArray, 2)

LastCell = ArrayLimit + 1

Set MyRange = .Range("A2:S" & LastCell)

MyRange = WorksheetFunction.Transpose(MyArray)

End With

End Sub

While both Application.ScreenUpdating = False and Application.Calculation = xlCalculationManual are invaluable in reducing macro runtime, I have had very positive experiences with combining those two lines with the use of abstracted data structures. It appears that data structures, in certain cases, appear to help in optimizing performance, especially where extensive line by line data extraction is involved in the macro process.

Scott Conover
  • 1,421
  • 1
  • 14
  • 27