0

I am trying to sum up values based on duplicate's found across "A-O" columns. Am using the below macro. There are around 500k+ records and the below macro hangs bad.

 Sub Formulae(TargetCol1, TargetCol2, ConcatCol, Col1, Col2, StartRow, EndRow, Sheet)

         Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Formula = "=SUMIF($" & ConcatCol & "$" & CStr(StartRow) & ":$" & ConcatCol & "$" & CStr(EndRow) & "," & ConcatCol & CStr(StartRow) & ",$" & Col1 & "$" & CStr(StartRow) & ":$" & Col1 & "$" & CStr(EndRow) & ")"

     Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Select
    Selection.Copy
    Sheets(Sheet).Range(TargetCol1 & CStr(EndRow)).Select
    Range(Selection, Selection.End(xlUp)).Select
    Application.CutCopyMode = False
    Selection.FillDown

    Call PasteSpecial(TargetCol1, "T", StartRow, EndRow)

    Sheets(Sheet).Range(TargetCol2 & CStr(StartRow)).Formula = "=SUMIF($" & ConcatCol & "$" & CStr(StartRow) & ":$" & ConcatCol & "$" & CStr(EndRow) & "," & ConcatCol & CStr(StartRow) & ",$" & Col2 & "$" & CStr(StartRow) & ":$" & Col2 & "$" & CStr(EndRow) & ")"

     Sheets(Sheet).Range(TargetCol2 & CStr(StartRow)).Select
    Selection.Copy
    Sheets(Sheet).Range(TargetCol2 & CStr(EndRow)).Select
    Range(Selection, Selection.End(xlUp)).Select
    Application.CutCopyMode = False
    Selection.FillDown

    Call PasteSpecial(TargetCol2, "U", StartRow, EndRow)


 End Sub


Sub PasteSpecial(Col1, Col2, StartRow, EndRow)

    Range(Col1 & CStr(StartRow)).Select
    Range(Selection, Selection.End(xlDown)).Select
    Selection.Copy
    Range(Col2 & CStr(StartRow)).Select
    Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
        :=False, Transpose:=False

End Sub

Let me explain the macro in short. I have Columns "A-O" and I have to group them...based on grouping I have to sum columns "P,Q". I have a function that makes a concatenated string out of the 16 columns and stores in "AA" column. Based on this column I use the sumif function to sum all duplicate values

 =SUMIF($AA$2:$AA$500000,$AA2,$P$2:$P$500000)
 =SUMIF($AA$2:$AA$500000,$AA2,$Q$2:$Q$500000)

Then I copy paste special as 'values' the above values to remove the formula, in 2 new cols (pasteSpecial function in above macro code).

Finally I call the remove duplicates to remove the duplicate values

I have used the .removeduplicates method which seems to work pretty fast even on such a huge dataset. Is there any predefined function in excel which would even sum the values of the duplicates and then remove the duplicate entries?

 Sub Remove_Duplicates_In_A_Range(StartRow, EndRow, Sheet, StartCol, EndCol, level)



Sheets(Sheet).Range(StartCol & CStr(StartRow) & ":" & EndCol & CStr(EndRow)).RemoveDuplicates Columns:=20, Header:=xlNo

End Sub

The above logic hangs bad eating all CPU resources and crashing badly...

Someone please optimize the above macro to make it work with 500k+ records. A performance of 1-2 mins max is acceptable.

Please help!!!

EDIT: By 500k+ records I mean A1:O500000. Am supposed to check for duplicates in this manner a combination of A1,B1,C1,D1,E1,F1,G1,H1,I1,J1,K1,L1,M1,N1,O1 with A2,B2,C2,D2,E2,F2,G2,H2,I2,J2,K2,L2,M2,N2,O2 and A3,B3,C3,D3,E3,F3,G3,H3,I3,J3,K3,L3,M3,N3,O3 and so on....till A500000,B500000 etc... .

In short am supposed to check the entire A1-O1 set matches with the entire A2-O2 or A3-O3 or..... A500k-O500k and so on

For every match between the entire A-O recordset I need to sum their respective P,Q columns . Say for example A1-O1 set matched with A2-O2 set then add P1,Q1 and P2,Q2 and store in P1,Q1 or something..

In either case, I need to retain each original recordset say,A1-O1 with the summed up values of its duplicates and its own in P1,Q1

I dont suppose we can attach a demo of the excel sheet here now, can we? :(

EDIT2:

Function for replicating sumif formula across all cells

 Sub PreNettingBenefits(StartRow1, EndRow1, StartRow2, EndRow2, Col_Asset, Col_Liab, Src_Col_Asset, Src_Col_Liab, ConcatCol, Src_ConcatCol, level, Sheet2, Sheet1)

'=SUMIF(Sheet1!$AA$2:$AA$81336,Sheet2!AA2,Sheet1!$P$2:$P$81336)
Application.Calculation = xlCalculationAutomatic
Sheets(Sheet2).Range(Col_Asset & CStr(StartRow2)).Formula = "=SUMIF(" & Sheet1 & "!$" & Src_ConcatCol & "$" & CStr(StartRow1) & ":$" & Src_ConcatCol & "$" & CStr(EndRow1) & "," & Sheet2 & "!" & ConcatCol & CStr(StartRow2) & "," & Sheet1 & "!$" & Src_Col_Asset & "$" & CStr(StartRow1) & ":$" & Src_Col_Asset & "$" & CStr(EndRow1) & ")"
Sheets(Sheet2).Range(Col_Asset & CStr(StartRow2)).Select
Selection.Copy
MsgBox Sheets(Sheet2).Range(Col_Asset & CStr(EndRow2)).Address
Sheets(Sheet2).Range(Col_Asset & CStr(EndRow2)).Select
Range(Col_Asset & CStr(StartRow2) & ":" & Col_Asset & CStr(EndRow2)).Select
Application.CutCopyMode = False
Selection.FillDown




Sheets(Sheet2).Range(Col_Liab & CStr(StartRow2)).Formula = "=SUMIF(" & Sheet1 & "!$" & Src_ConcatCol & "$" & CStr(StartRow1) & ":$" & Src_ConcatCol & "$" & CStr(EndRow1) & "," & Sheet2 & "!" & ConcatCol & CStr(StartRow2) & "," & Sheet1 & "!$" & Src_Col_Liab & "$" & CStr(StartRow1) & ":$" & Src_Col_Liab & "$" & CStr(EndRow1) & ")"
Sheets(Sheet2).Range(Col_Liab & CStr(StartRow2)).Select
Selection.Copy
MsgBox Sheets(Sheet2).Range(Col_Liab & CStr(EndRow2)).Address
Sheets(Sheet2).Range(Col_Liab & CStr(EndRow2)).Select
Range(Col_Liab & CStr(StartRow2) & ":" & Col_Liab & CStr(EndRow2)).Select
Application.CutCopyMode = False
Selection.FillDown


Application.Calculation = xlCalculationManual


End Sub

It hangs pretty bad. Whts the problem in replicating the formula across 30k-40k rows. Could someone please optimise the code?

Community
  • 1
  • 1
Sunny D'Souza
  • 616
  • 5
  • 11
  • 27
  • I'm a little confused by the grouping. You just want to sum all the dupes (each occurrence after first) for all the cells in rows A-O? – Gaijinhunter Aug 23 '11 at 13:37
  • yeah, I agree wasnt much clearer...I edited my main question. I guess it should be more clear now. – Sunny D'Souza Aug 24 '11 at 04:02
  • Remember to turn off screenupdating by placing "application.screenupdating = false" at the start of the code and then turn it back to true at the end. That should help speed up things quite a bit. – Gaijinhunter Aug 24 '11 at 07:04

4 Answers4

3

Something must be terribly wrong with how you are doing the adding of the duplicates. Since you were scant on details of the data you are working with, I don't know if this is the same, but I populated A1:O33334 (over 500k cells) with a random number between 1 and 10,000.

Using a dictionary object (I am known for my love and over-use of it), I went through all of them and summed only the duplicate values and then slapped the unique list of elements into column A in sheet2.

Reasons why a dictionary might be the thing to use:

  • You can weed out duplicates
  • You can check if a value exists in the dictionary or not
  • You can transpose the unique list easily onto Excel

The dupe checking and addition, and copying the unique cells only takes 2 seconds. Here is the code for your reference.

Sub test()

Application.ScreenUpdating = False
Dim vArray As Variant
Dim result As Long
Dim dict As Object
Set dict = CreateObject("scripting.dictionary")

vArray = Range("A1:O33334").Value

On Error Resume Next
For i = 1 To UBound(vArray, 1)
    For j = 1 To UBound(vArray, 2)
        If dict.exists(vArray(i, j)) = False Then
            dict.Add vArray(i, j), 1
        Else
            result = result + vArray(i, j)
        End If
    Next
Next

Sheet2.Range("a1").Resize(dict.Count).Value = _
Application.Transpose(dict.keys)

Application.ScreenUpdating = True
MsgBox "Total for duplicate cells: " & result & vbLf & _
    "Unique cells copied: " & dict.Count

End Sub
Gaijinhunter
  • 14,587
  • 4
  • 51
  • 57
  • Thanks Issun, but 500k+ records meant A1:O500000. Am supposed to check for duplicates in this manner a combination of A1,B1,C1,D1,E1,F1,G1,H1,I1,J1,K1,L1,M1,N1,O1 with A2,B2,C2,D2,E2,F2,G2,H2,I2,J2,K2,L2,M2,N2,O2 and A3,B3,C3,D3,E3,F3,G3,H3,I3,J3,K3,L3,M3,N3,O3 and so on....till A500000,B500000 etc... for every match between the entire A-O recordset I need to sum their respective P,Q columns . Say for example A1-O1 set matched with A2-O2 set then add P1,Q1 and P2,Q2 and store in P1,Q1 or something..Does the above code handle it? – Sunny D'Souza Aug 24 '11 at 03:54
  • Thanks for the additional info. Sorry the above code won't handle exactly what you are looking for since it doesn't do any grouping. When you say A1-Q1 set you mean you are only comparing the rows to each other? What type of data are in each cell? Is it string data in A-O and numeric data in P and Q? – Gaijinhunter Aug 24 '11 at 05:21
  • yeah numeric data in P and Q. But Col A to Col O , some have numeric and some have string, I use Concatenate to make a combined string of these 16 columns, makes it easy. Yes, I am comparing the rows to each other. To put it in more granularity, am actually checking A1 with A2, B1 with B2, C1 with C2 and so on...till O1 with O2 and if all are same...sum the values in P,Q colums accordingly....Can you please help, Issun :( – Sunny D'Souza Aug 24 '11 at 05:29
  • 1
    I'll try. Is expected case that they will be a match? My thought is that it's best to check each value one by one, that way you can termninate the check once there is one cell that is off. – Gaijinhunter Aug 24 '11 at 05:41
  • I tried that as first approach...but there are 500k+ records...nopes...it hangs pretty hard. :((( Yes thats the expected case. moreover I need to put the grouping in loops...like first group/sum based on col A-O then do group/sum based on A-N then A-M...so the no of Col in grouping keep decresing so definately in every run the number of records would decrease...when it comes to only A you might end with only 1-2 records – Sunny D'Souza Aug 24 '11 at 05:45
  • Thanks for your efforts Issun – Sunny D'Souza Aug 24 '11 at 05:45
  • I'm sorry Sunny, I've tried several things, but I'm afraid this is beyond my scope of expertise and time... Sorry I couldn't be of more help. If you want to keep the question active, un-accept the answer. :) best of luck! – Gaijinhunter Aug 24 '11 at 06:20
  • 2
    Your snippets of code are still very interesting to read and i should use more often dictionaries :) >> +1 – JMax Aug 24 '11 at 06:32
  • issun: It alright ;) on an unrelated note, can you tell me how I can get the start rowno and endrow no of a set when i apply filters on a particular column? – Sunny D'Souza Aug 24 '11 at 06:50
  • For column A you can find first cell with a value using "range("A1").End(xlDown).Row, and for last cell with a value, "range("A" & rows.count).End(xlUp).Row" – Gaijinhunter Aug 24 '11 at 08:15
2

You shouldn't select every cell when executing code.

Btw, if you have a look at your code, some statements are useless:

Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Select
Selection.Copy

is never pasted

For performance issue, see some tips within this thread: Benchmarking VBA Code

Community
  • 1
  • 1
JMax
  • 26,109
  • 12
  • 69
  • 88
  • Thanks JMax for the responce, but could you please edit my macro to show where I could remove select statements. How should I go about if I dont use select. AM new to VBA but have a deliverable...please help :((( – Sunny D'Souza Aug 23 '11 at 13:36
  • 1
    JMax should get a cut of your payment for that deliverable if he actually does edit it. :) – Gaijinhunter Aug 23 '11 at 13:46
1

The essence of the question, as I understand it, is to find the duplicates and add them up, and then delete them. You also mentioned grouping them but it is not clear how. In any case, I would ditch the macros. Operations on individual rows aren't going to work on that dataset.

Here are some steps I would take. Modify them to fit your needs:

Use the concatenate function to create a new column to the right of your dataset. For example

=concatenate(a2,b2,c2,d2,e2)

Create a column called Dups and use the following to populate it:

=if(countif(dataSetNamedRange,aa2)>1,1,0)

In the code above, aa2 refers to the concatenated column for that row. The result of the above is that you now have all dups flagged. Now use the filter tools in the Data menu to create a sort or a filter to fit your grouping needs. To add up the values, use DSum. To delete the dups, use an advanced filter. Good luck.

Andrew Cowenhoven
  • 2,778
  • 22
  • 27
0

I am adding this as a second answer since it's going to get long...

Becuase I am a stubborn mule, I tried many different things, I think you've reached the limit of what Excel can do. The best function I could come up with was was this, and note I am using 50,000 rows, not your 500,000:

  • 50,000 rows with 100 unique rows, randomly spread: 1m:47s
  • 50,000 rows with 50 unique rows, randomly spread: 57s
  • 50,000 rows with 25 unique rows, randomly spread: 28s
  • 50,000 rows with 10 unique rows, randomly spread: 12s
  • 50,000 rows with 5 unique rows, randomly spread: 6s

As you can see, the function will deteriorate as the number of unique rows increases. I have a lot of wacky ideas here, so I thought I'd share my code for the sake of research:

  • I take the entire range of 750k cells and dump it into a variant array (.2 seconds)
  • I dump the P & Q rows into a similar variant array for use later
  • I make an array of 50,000 strings (rows) from the variant array (only 1 seconds or so!)
  • I say goodbye to the massive variant array to clean up memory
  • I start my loop through each row, comparing against all 50k rows...
  • If a dupe row is found, it's added to the dupe dictionary so we don't do the same process on that row later
  • When the dupe is found, it's P&Q totals are added to the P&Q of the row in question
  • After checking all 50k rows, we slap the total into the R column of the row and move on
  • If the row has been noted as a dupe in the dupedict, we skip it (evil GoTo beware!)
Sub test()

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual

Dim rowArray As Variant
Dim totalArray As Variant
Dim i As Long, j As Long
Dim dupeDict As Object
Set dupeDict = CreateObject("scripting.dictionary")
Dim count As Long
Dim rowData() As String

'dump the cells into an single array
rowArray = Range("A1:O50000").Value

'grab totals from P and Q to keep them seperate
totalArray = Range("P1:Q50000").Value

'create strings for each row
ReDim rowData(1 To 50000)

'create a string for each row
For i = 1 To 50000
    For j = 1 To 15
        rowData(i) = rowData(i) & rowArray(i, j)
    Next
Next

'free up that memory
Set rowArray = Nothing

'check all rows, total P & Q if match
On Error Resume Next
For i = 1 To 50000
    'skip row and move to next if we've seen it
    If dupeDict.exists(i) = True Then
        GoTo Dupe
    End If
    count = 0
    For j = 1 To 50000
        If rowData(i) = rowData(j) Then
            dupeDict.Add j, 1 'add that sucker to the dupe dict
            count = count + totalArray(j, 1) + totalArray(j, 2)
        End If
        'enter final total in column R
        Cells(i, 18).Value = count
    Next
Dupe:
Next

Application.ScreenUpdating = True
Application.Calculation = xlCalculationAutomatic

End Sub
Gaijinhunter
  • 14,587
  • 4
  • 51
  • 57
  • Thanks issun, even i have hit a logic, grouping is done in seconds, but summing is taking long. I am using =sumif in one cell and replicating it across 30k-40k cells at a time I dunno why its hanging. Check the main post to see the function for sumif...do you have an email address or chat where we could take it in private. once we find a final solution maybe we can post it here... – Sunny D'Souza Aug 24 '11 at 12:57
  • I appreciate the offer Sunny, but I'm pretty maxed out on this problem - spent hours on it last night working out and timing different solutions trying to find the fastest. In the end, I'm not educated enough about big scale data processing and/or Excel just isn't strong enough to do this that fast. :) – Gaijinhunter Aug 25 '11 at 08:12