2

I have two datasets that I need to compare and extract a match from. I have a composite key from 5 columns in each dataset, end a 6th column i need to extract. The columns are composed of text, date and integers. Both sets are slightly under 500k rows.

Currently I use a for loop in table a and loop through table b. Compare the rows with an if statement with the and argument to get the composite key.

Sub ArraySearch()

    Dim Main As Long
    Dim Search As Long
    Dim arrData() As Variant
    Dim arrSource As Variant

    arrData = Sheets("Sheet1").Range("H3:M500000").Value
    arrSource = Sheets("Ark1").Range("A3:H500000").Value

    Main = 1
    Search = 1

    For Main = 1 To UBound(arrSource, 1)

        For Search = 1 To UBound(arrData, 1)

            If arrSource(Main, 3) = arrData(Search, 1) And _
                arrSource(Main, 4) = arrData(Search, 2) And _
                arrSource(Main, 1) = arrData(Search, 3) And _
                arrSource(Main, 2) = arrData(Search, 4) And _
                arrSource(Main, 5) = arrData(Search, 5) _
            Then
                arrSource(Main, 8) = arrData(Search, 6)
                Exit For
            End If

        Next
    Next

    Sheets("Sheet2").Range("A3:H500000") = arrSource

End Sub

The fastest way so far is to load both tables into an array and do an in memory loop.

This is taking for ever. We are talking about hours not minutes.

Are there any methods that will increase the speed? Or do I need to use some other programs? (load it into a database and use SQL, use visual studio with normal VB.net, SSIS)

I was hoping this could be done in VBA, so any pointers would be much appreciated.

EDIT

Would hashing the 5 column key improve speed, or is it the share volume of rows that has to be iterated that creates the lag?

Community
  • 1
  • 1
Sweetspot
  • 91
  • 1
  • 9

4 Answers4

5

The fastest way to compare two lists is to add values to Dictionary based on a common key. The Dictionary is optimized to search for keys and will return a value based on the key much faster then you can iterate through an array.

Sub DictionarySearch()
    Dim dict
    Dim key As String
    Dim x As Long
    Dim arrData() As Variant
    Dim arrSource As Variant

    Set dict = CreateObject("Scripting.Dictionary")

    arrData = Worksheets("Sheet1").Range("H3:M500000").Value
    arrSource = Worksheets("Ark1").Range("A3:H500000").Value

    For x = 1 To UBound(arrData, 1)
        key = arrData(x, 1) & ":" & arrData(x, 2) & ":" & arrData(x, 3) & ":" & arrData(x, 4) & ":" & arrData(x, 5)
        If Not dict.Exists(key) Then dict.Add key, arrData(x, 6)

    Next

    For x = 1 To UBound(arrSource, 1)
        key = arrSource(x, 3) & ":" & arrSource(x, 4) & ":" & arrSource(x, 1) & ":" & arrSource(x, 2) & ":" & arrSource(x, 5)
        If dict.Exists(key) Then arrSource(x, 8) = dict(key)
    Next

    Sheets("Sheet2").Range("A3:H500000") = arrSource
End Sub
  • Hmm, I've never known this method. Could you try to use Dictionary to answer [my question](http://stackoverflow.com/q/38530726/3397819)? I'm really interested in knowing it so that I can learn it. I appreciate your help. – Anastasiya-Romanova 秀 Jul 26 '16 at 09:28
  • Nice one. I'll give it a go and see how it performs. Thanx. – Sweetspot Jul 26 '16 at 09:29
  • I'm referring to [this question of mine](http://stackoverflow.com/q/38530726/3397819). I did answer it by myself but I'm not satisfied yet. – Anastasiya-Romanova 秀 Jul 26 '16 at 09:31
  • I will look at it. You might be interested in this article [EXCEL VLOOKUP VS INDEX MATCH VS SQL VS VBA](http://analystcave.com/excel-vlookup-vs-index-match-vs-sql-performance/). It compares several lookup methods including dictionaries, SQL,match & index, vlookup and DOUBLE TRUE VLOOKUP. I find the article is pretty interesting. I never seen the DOUBLE TRUE VLOOKUP anywhere else. –  Jul 26 '16 at 09:37
  • Holy crap! Seem like you'r kung fu is stronger then my kung fu. Took less than 15 sec to execute! – Sweetspot Jul 26 '16 at 09:37
2

Not a full answer, but an idea worth a try. In this answer of mine to my own question I use some speedup tricks like using .Value2 rather than the default property (.Value) and assigning vbNullString instead of Zero Length String ("") to the elements of the array that has found its match to make Excel do less processing. Maybe you can use Heap's algorithm like this answer though I'm not so sure.

Community
  • 1
  • 1
1

Welcome to the fantastic world of performance improvement :-)

Let me explain you what you are doing: You are taking two datasets, each of them containing 500,000 entries. Then you are looping through both of them, like this:

for every member in dataset1 do
  for every member in dataset2 do
    if condition1 is met, and
    if condition2 is met, and
    if condition3 is met, and
    if condition4 is met, and
    if condition5 is met
    then do something
    end if-loop
  end for-loop (dataset2)
end for-loop (dataset1)

When you count the number of actions you are doing, we see the following:

500,000 runs through dataset1
500,000 runs through dataset2
5 (number of conditions to check)
=> 1250,000,000,000 actions, this is enormous!

In top of that, you are using VBA: VBA is a scripting language, which means that every line of code is translated into machine language the moment you arrive at this line of code (if you would use another language, one which you can compile, the translation into machine language would be done only one time and this machine language would be executed afterwards)

In case you would like to continue working with VBA, there are two advises I can give to you:

  1. in case this is possible, try to work with sorted datasets
  2. minimise the number of conditions to check

This would lead to this new algorithm:

for every member in dataset1 do
  go in dataset2 from the start to the maximum, defined by the first for-loop, and do
    if condition1 is met, then:
      if condition2 is met, then:
        if condition3 is met, then:
          if condition4 is met, then:
            if condition5 is met
            then do something
            end if-loop
          end if-loop
        end if-loop
      end if-loop
    end if-loop
  end for-loop (dataset2)
end for-loop (dataset1)

This way of working would reduce the amount of actions your computer needs to execute:

500,000 runs through dataset1
log(500,000) runs through dataset2 (it's only browsed until a certain limit)
3 conditions (on average)
=> 500,000 * log(500,000) * 3 = 8,500,000 actions (on average), which is manageable

I hope this makes sense to you. In my humble opinion the main issue here is the sorting of the datasets, which will improve most your performance!

Dominique
  • 16,450
  • 15
  • 56
  • 112
  • 1
    Did something similar, where I used 3 nested IF statements, with the most unique condition in the outer most. It improved the speed quite a bit, but not as much as running it in memory in arrays. (Didn't combine the technique tho). – Sweetspot Jul 26 '16 at 09:43
0

That's a lot of data that Excel needs to evaluate. The question is, is Excel the best solution for this, or is it better if you write en Application in C++ or C# for the comparing? Because they would be much faster.

But if you need to do this in VBA maybe this Code can Help you. I use this always to Compare 2 Ranges if they have the same Data's. And i never had a Speed Problem with that Code, so maybe you can take a look at it.

    Sub Start()
    Dim rng1 As Range
    Dim rng2 As Range
    Dim bolNotEqual As Boolean
    Set rng1 = Sheets("Sheet1").Range("H3:M500000").Value
    Set rng2 = arrSource = Sheets("Ark1").Range("A3:H500000").Value

    'Compare the Sheets if both are Equal
    Call CompareWorksheetRanges(rng1, rng2, bolNotEqual)

End Sub



Sub CompareWorksheetRanges(rng1 As Range, rng2 As Range, ByRef bol As Boolean)

Dim r As Long, c As Integer

Dim lr1 As Long, lr2 As Long, lc1 As Integer, lc2 As Integer
Dim maxR As Long, maxC As Integer, cf1 As String, cf2 As String

Dim rptWB As Workbook, DiffCount As Long

'If one rng is Empty Exit sup
If rng1 Is Nothing Or rng2 Is Nothing Then Exit Sub


    If rng1.Areas.Count > 1 Or rng2.Areas.Count > 1 Then
        MsgBox "Can't compare multiple selections!", _
            vbExclamation, "Compare Worksheet Ranges"
        Exit Sub
    End If

    Application.StatusBar = "Creating the report..."
    'Testing if the Ranges have the Same sice
    Set rptWB = Workbooks.Add

    With rng1
        lr1 = .Rows.Count
        lc1 = .Columns.Count
    End With
    With rng2
        lr2 = .Rows.Count
        lc2 = .Columns.Count
    End With
    maxR = lr1
    maxC = lc1
    If maxR < lr2 Then maxR = lr2
    If maxC < lc2 Then maxC = lc2
    If lr1 <> lr2 Or lc1 <> lc2 Then
        If MsgBox("The two ranges you want to compare are of different size!" & _
            Chr(13) & "Do you want to continue anyway?", _
            vbQuestion + vbYesNo, "Compare Worksheet Ranges") = vbNo Then Exit Sub
    End If

    'End Testing sice




     DiffCount = 0
'Compare the Ranges if same Value
        For c = 1 To maxC
            Application.StatusBar = "Comparing cells " & _
                Format(c / maxC, "0 %") & "..."
            For r = 1 To maxR
                cf1 = ""
                cf2 = ""
                On Error Resume Next
                cf1 = rng1.Cells(r, c).FormulaLocal
                cf2 = rng2.Cells(r, c).FormulaLocal
                On Error GoTo 0
                If cf1 <> cf2 Then
                    DiffCount = DiffCount + 1
                    Cells(r, c).Formula = "'" & cf1 & " <> " & cf2
                End If
            Next r
        Next c

        Application.StatusBar = "Formatting the report..."



        rptWB.Close False

        Set rptWB = Nothing

        If DiffCount = 0 Then
            bol = False
        Else
            bol = True

        End If
        Application.StatusBar = False
        Application.ScreenUpdating = True

    End Sub
Moosli
  • 3,140
  • 2
  • 19
  • 45