3

I have a function that is used to find the information in a Excel worksheet knowing that: - The Key can be in a variable column - Variable fields can be searched Sheets usually have less than a hundred column, but can have anything from a few hundred to 100 000 rows to search. In our biggest files, the function I'm trying to optimize can be used about a million times.

After reading https://fastexcel.wordpress.com/2011/10/26/match-vs-find-vs-variant-array-vba-performance-shootout/

... and finding our function used Find (3 times), I tried using arrays.

This is the code I wrote

Function getInfo(Key As String, NameField As String, NameKey As String, WksName As String) As Variant

On Error GoTo Error

Dim iColumnKEY As Integer
Dim iColumnFIELD As Integer
Dim i As Integer
Dim ListFields, ListKeys As Variant

ListFields = Worksheets(WksName).Range("A1:ZZ1")


i = LBound(ListFields, 2)

'To identify which column contains the Key and which one contains the  
'information we are searching for
Do While iColumnKEY=0 Or iColumnFIELD=0
    If i > UBound(ListFields, 2) Then
        getInfo = "//error\\"

    ElseIf ListFields(1, i) = NameKey Then
        iColumnKEY = i
    ElseIf ListFields(1, i) = NameField Then
        iColumnFIELD = i
    End If
i = i + 1
Loop

Dim iROW As Integer

ListKeys = Worksheets(WksName).Columns(iColumnFIELD)

i = LBound(ListKeys, 1)
Do While iROW=0
    If i > UBound(ListKeys,1) Then
        getInfo = "//error\\"

    ElseIf ListKeys(i,1) = Key Then
        iROW = i
    End If
    i = i + 1
 Loop

 getInfo = Worksheets(WksName).Cells(iROW, iColumnFIELD)

Exit Function

Error:
    getInfo = "//error\\"

End Function

The code works, but is very slow. What am I doing that is slowing things down?

It is not in the code right now, but I did try turning the screen update down, as well as automatic calculation down. I didn't see any difference in speed, which indicates me that the basic algorithm is the main issue.

Also, the article was in 2011. Are arrays still a lot faster than Match/Find?

As a side note: eventually, I'll suggest having a macro that search for a range of Keys in a batch, instead of calling the function for every single key. This would means the first Do... While loop would be done only once for a macro, and only the Do_While for Rows would be run for every key. However, this is not an option in the very short term.

Thanks. Any help or advice would be greatly appreciated.

Jade
  • 39
  • 2
  • For clarity make sure you write `ListFields = Worksheets(WksName).Range("A1:ZZ1").Value`. This guarantees the values of the cells are being stored and not just the `Range` object. – John Alexiou Sep 15 '21 at 12:00
  • Also never use `Integer` as it is a 16-bit value. Use `Long` instead _always_ to avoid any overflow errors. – John Alexiou Sep 15 '21 at 12:02

5 Answers5

1

To make sure I understood you correctly, you have a sheet that has a random column that contains unique keys.

you want to search for one of these keys and return related info (like row no, etc) many times

Approach:

  1. Find the column in which the keys are listed.
  2. Load that column in a dictionary(Indexed).
  3. Use GetInfo function to return info about a specific key if it exists.

Dependencies:

Microsoft scripting runtime (Tools > refrences > Microsoft scripting runtime)

code:

Option Explicit

Private KeyDictionary As Scripting.Dictionary

Sub PopulateDictionary(ByRef WS As Worksheet, ByVal FieldName As Variant)

    Dim i As Long, LastRow As Long, iColumnFIELD As Long
    Dim ListKeys As Variant
    
    iColumnFIELD = WS.Range("A1:ZZ1").Find(FieldName).Column
    
    With WS                                            'Finds the last row in the sheet
        If Application.WorksheetFunction.CountA(.Cells) <> 0 Then
            LastRow = .Cells.Find(What:="*", _
                          After:=.Range("A1"), _
                          LookAt:=xlPart, _
                          LookIn:=xlFormulas, _
                          SearchOrder:=xlByRows, _
                          SearchDirection:=xlPrevious, _
                          MatchCase:=False).Row

        End If
    
        Set KeyDictionary = New Scripting.Dictionary
        
        For i = 1 To LastRow                                                        'populates Dictionary with Key, Row number pair
            If Not (KeyDictionary.Exists(.Cells(i, iColumnFIELD))) Then                   'Make sure the key doesn't already exist(Key values should be unique)
                KeyDictionary.Add .Cells(i, iColumnFIELD).Value, .Cells(i, iColumnFIELD).Row
            End If
        Next i
    End With
End Sub

Function getInfo(ByVal key) As Variant
    
    If KeyDictionary.Exists(key) Then
        getInfo = KeyDictionary(key)                     'if the key exist return row number (or whatever you want to)
    Else
        getInfo = "Null"                            'Else return whatever you want like a msgbox "not Found" , etc
    End If
    
End Function

usage

'populates and returns the row number of key 9500
Sub TestSearch()

    PopulateDictionary ThisWorkbook.Worksheets("Sheet1"), "Key"
    Debug.Print getInfo(9500)

End Sub

Notes: -almost always Use long instead of integer , not much performance difference, but can save you from overflow pitfalls.

-you can add a reference to the range containing the key instead of the row number that would more flexible

-Passing a sheet by reference (Full Ref) is better than passing just its name and avoids a lot of possible problems like the case of multiple workbooks with the same sheet name and makes your code more reusable.

References: Dictionary object

Edit:

I misunderstood your request , thought you wanted to know the best method available.

here's a performance comparison of the four methods:

  • it takes 1325 ms (Or 1.3 seconds) to populate the Dictionary with the unique key list the first time (100,000 Row)
  • it takes 1.79646327708265E-02 ms aka 0.02 ms To search for an item at the end of list (row 99863) using the dictionary object
  • it takes around 10.5 ms To search for the same item with WorksheetFunction.Match
  • it takes around 50 ms To search for the same item with the array method
  • it takes around 80 ms To search for the same item with the Range.find

Result:

Dictionary Method is faster than match -the second-best method of the Four- by over 500 Times!

The reason is that the keys are indexed inside the dictionary, unlike the other methods.

notes:

Office 2016 was used on a 2 cores (3.20 GHz) machine with 8 gigs or ram (Dictionary took about extra 8 Megabytes ram)

All of these searches were done on the same data set (the search was done on only 1 column with 100,000 unique keys ,with the searched for value at the bottom of the list)

The break-even point on whether you should use Match or Dictionary is around 120 searches. if the code will search for more than 120 values then it's better to use the dictionary approach.

Windows API QueryPerformanceCounter was used for High resolution timer.

Code line used to search for Values(not gonna put the full sub)

'Match
WorksheetFunction.Match(Key, ThisWorkbook.Worksheets(1).Range("CW:CW"), 0)

'Find
ThisWorkbook.Worksheets(1).Range("CW:CW").Find(Key).Row

'Array
'Loops through the column till it finds a match
Amr
  • 128
  • 7
  • As mentioned when I posted the bounty, I am mostly interested in what method between match, find, and using a variant array gives the best performance when searching for a value in a range. I see you have chosen a fourth approach of putting the range into a dictionary. I kinda doubt this is more efficient than any of the other 3 methods, but if you can support this solution I would be very interested in seeing that documentation. – eirikdaude Sep 14 '21 at 12:06
  • @eirikdaude do you know these two articles: https://fastexcel.wordpress.com/2011/10/26/match-vs-find-vs-variant-array-vba-performance-shootout/ and https://fastexcel.wordpress.com/2014/08/25/vba-searching-shootout-between-worksheetfunction-match-performance-pros-and-cons/. They are quite old - but I suppose they might be still valid – Ike Sep 15 '21 at 09:54
  • 1
    They are (likely) still valid, however, the method of using a dictionary (or a set of dictionaries) is by far faster for large data sets - assuming of course that you create the dictionaries only once. See https://stackoverflow.com/a/67743112/7599798 – FunThomas Sep 15 '21 at 16:49
0

In your code you never use iColumnKEY

I think this is what you are actually after:

Function getInfo(key As String, NameField As String, NameKey As String, WksName As String) As Variant

    Dim keyCol As Variant, fieldCol As Variant, keyRow As Variant
    Dim errMsg As String

    getInfo = "//error\\"

    With Worksheets(WksName)
        With Intersect(.UsedRange, .Columns("A:ZZ")) ' <--| reference a range in passed worksheet cells belonging to columns "A" to "ZZ" from worksheet first used row to last used one and from worksheet first used column to last used one
            MsgBox .Address
            fieldCol = Application.Match(NameField, .Rows(1), 0) '<--| look for passed 'NameField' in referenced range
            If IsError(fieldCol) Then
                errMsg = " :field column '" & NameField & "' not found"
            Else
                keyCol = Application.Match(NameKey, .Rows(1), 0) '<--| look for passed 'NameKey' in referenced range
                If IsError(keyCol) Then
                    errMsg = " :key column '" & NameKey & "' not found"
                Else
                    MsgBox .Columns(keyCol).Address
                    keyRow = Application.Match(key, .Columns(keyCol)) '<--| look for passed 'key' in referenced range 'NameKey' column
                    If IsError(keyRow) Then
                        errMsg = " :key '" & key & "' not found in column '" & NameKey & "'"
                    Else
                        getInfo = .Cells(keyRow, fieldCol) '<--| get referenced range "item"
                    End If
                End If
            End If
            If errMsg <> "" Then getInfo = getInfo & errMsg
        End With
    End With
End Function
user3598756
  • 28,893
  • 4
  • 18
  • 28
  • Thanks! I made some improvements last night based on the previous comments. – Jade Sep 17 '16 at 12:52
  • You are welcome. You may want to give this a try and let me know. Good coding! – user3598756 Sep 17 '16 at 12:54
  • @Jade, did you get through it? – user3598756 Sep 24 '16 at 06:43
  • After a while, I revisited the issue. Changing the program as suggested above did provide some improvement, but it was marginal. The main issue is that the actual function is called many times (can be 100,000 or more). So I looked at ways to reduce the amount of search. The first thing I did was to create global variable for the worksheet, Field and Name. – Jade Dec 06 '16 at 19:25
0

I see that in your loop you are doing a UBound() evaluation every time. This is not needed.

The following should be faster than a Do While loop. Notice that the array returned by Range().Value has always a lower bound of one. No need to call LBound()

Also, find where the last data exists in the column and restrict the loop to that range. I do this with .End(xlUp)

Dim ListKeys() as Variant
Dim iROW As Long, nRows as Long

nRows = Worksheets(WksName).Cells(Worksheets(WksName).Rows.Count, iColumnFIELD).End(xlUp).Row

ListKeys = Worksheets(WksName).Cell(1, iColumnFIELD).Resize(nRows,1).Value

For i=1 To nRows
    If ListKeys(i,1) = Key Then
        iROW = i
        Exit For
    End If
Next i
John Alexiou
  • 28,472
  • 11
  • 77
  • 133
0

not an answer but a radically different approach, since im from data-science background i use these methods for fast searching any data in a database which are few GB in size, in your case excel for example. this approach can be parallelized based on number of CPUs in your system. and uses python framework Pandas, which has very big community incase you need support, VB has limited community.

also read this before judging this answer https://corporatefinanceinstitute.com/resources/knowledge/other/transitioning-from-excel-to-python/

i expect criticism for this answer , OP asked this but you are giving this blah. but if you want faster development times for ever changing business needs you need something fast, and easy to maintain. python makes it easy, pandas makes it fast.

to get started read this.https://towardsdatascience.com/read-excel-files-with-python-1000x-faster-407d07ad0ed8

i will mention the pipeline here however. see very few lines of code!!! finish work faster, go home early.

  1. import the excel file as csv

     import pandas as pd
     dataframe=pd.read_excel("file.xlsx")
     item=dataframe[dataframe["Order ID"]==886714971] #condition based searching in excel 
    

note "Order ID" is just any arbitary column and you can use SQL like logic here which resembles match/find in VBA.

for speed reference iterating 1,000,000 rows took 0.03 seconds, which means a transaction speed of 30 TPS. use https://modin.readthedocs.io/en/latest/ to scale up that speed linearly with number of cores in cpu.

nikhil swami
  • 2,360
  • 5
  • 15
-1

To find out what parts of the code are the slowest, you can use Timer:

Dim t as Single
t = Timer
' part of the code
Debug.Print CDbl(Timer - t) ' CDbl to avoid scientific notation

Using .Value2 instead of .Value should help a bit:

ListFields = Worksheets(WksName).Range("A1:ZZ1").Value2

Searching for the key and field in two separate loops should be a bit faster because there will be less comparisons. Also, I am not sure if it will be a bit slower or faster, but you can iterate even multi-dimensional arrays:

Dim i As Long, v ' As Variant
i = 1

For Each v in ListFields 
    If v = NameKey Then
        iColumnKEY = i
        Exit For
    End If
    i = i + 1
Next
Slai
  • 22,144
  • 5
  • 45
  • 53
  • 1
    Excel uses 20-bits for rows (MAXROW = 2^20 = 1048576) and `Integer` is a 16-bit number. The code above is going to error out if more than 65536 rows are used. Please use _always_ `Long` for integer numbers in VBA which is the native 32-bit integer used by the CPU. – John Alexiou Sep 15 '21 at 14:05