-3

I would like to highlight duplicate rows in Excel VBA. Assume I have the following exemplary table with columns A, B, C and D for testing:

 A       B       C       D (Strings)

 1       1       1       dsf
 2       3       5       dgdgdgdg
 1       1       1       dsf
 2       2       2       xxx
 6       3       4       adsdadad
 2       2       2       xxx

The duplicate rows should be highlighted in any colour, e.g. grey. I am looking ideally for fast performing code, as it will be used for rather big tables. Note there are solutions available for highlighting duplicate cells (but not duplicate rows). I don't know how to identify if rows are duplicates and at the same time how to do that fast, i.e. without nested looping. The solution should be in VBA (not Excel).

What is the best/fastest way to achieve that?

pnuts
  • 58,317
  • 11
  • 87
  • 139
EDC
  • 613
  • 2
  • 7
  • 16
  • what about sorting first and then you only need one loop that checks if one row is like the row before? –  May 31 '15 at 15:09
  • @pony2deer I don't know if that makes a difference because you would still have to loop through the entire table for every single row, even if it's sorted – EDC May 31 '15 at 15:14
  • even a built-in function needs to loop at least once over the whole table. you can't compare values without 'touching' them at least once –  May 31 '15 at 15:16
  • but you need to check the "partial table" then for every row, so its still entire table... –  May 31 '15 at 15:24
  • I added sumproduct formula which you can add to conditional formatting - but performance might not be the best, a VBA solution is rather trivial to write – Vincent De Smet May 31 '15 at 15:27
  • 2
    that is what is said in my very first comment! check every row against the one before. please read it first –  May 31 '15 at 15:30
  • to sort the rows, wouldn't you need to make a key column combining all columns to sort on? – Vincent De Smet May 31 '15 at 15:34
  • I edited the example. Unfortunately not every row contains values, but also strings. Those muse be compared as well. Sorry I should have mentioned that. – EDC May 31 '15 at 16:11
  • 1
    Also I need a VBA solution, not an Excel one. I could record a macro of course so that's kind of a minor issue. But using VBA you can do often more than just with macros. – EDC May 31 '15 at 16:21
  • @Alex Richter: No before the edit only 1,1,1 and 2,2,2 were duplicates, but certainly not all rows. I have changed the example to fit that case. Also `[...]` was supposed to mean that the final table would be larger and would have columns with various values, not only integers and strings of a certain format. But I have removed it now. Also a duplicate row for me is if every value of one row is equal to every value of another. – EDC May 31 '15 at 17:12
  • I would humbly suggest you write (or record) some VBA code that accomplishes exactly what you want to do regardless of performance then try posting in [Code Review (Excel)](http://codereview.stackexchange.com/questions/tagged/excel) to see if performance improvements can be offered. –  May 31 '15 at 17:29
  • @Jeeped I have posted something there before. Someone told me that the place is only to review existing code, but not to rewrite any code that already works. Also I did not have any code that was working before. The problem I posted was summarized too easily at first, because I forgot that my rows also contained text (not only numbers). But that has been fixed now. – EDC May 31 '15 at 17:35
  • Put in some self effort. As it is now there are exactly 4 columns, is it? So you had been suggested using `COUNTIFS`. This is also possible with VBA via `WorksheetFunction.CountIfs` or `Evaluate("COUNTIFS...")`. So perform a single loop over all used rows and check for each row whether `COUNTIFS(A1:A[lastRow], A[rowNum], B1:B[lastRow], B[rowNum], C1:C[lastRow], C[rowNum], D1:D[lastRow], D[rowNum]) > 1`. If so, then this row is a duplicate. If you are stuck with the code, then post the code and where you are stuck and ask for help here. – Axel Richter Jun 01 '15 at 05:17
  • Ok `countifs` sounds like a plan, but as I posted below it would be probably much faster to work with filters here. When I have time I will try to create a solution based on `countif`. Advanced filters or Dictonaries are beyond of what I am capable atm and I don't have sufficient time to learn VBA properly right now, otherwise I would not ask for help in the first place. – EDC Jun 03 '15 at 19:58
  • @EDC can you provide some test file (similar in size and structure to real files you're working on)? I would create solution based on advanced filters and compare it with formula solution. I'll repeat one of my previous comments: filter solution will highlight only one of two duplicated rows (or n-1 of n) – BrakNicku Jun 04 '15 at 07:12
  • @user3964075 sure, here u go (file is a csv. I used the first free upload site that I could find): http://www.file-upload.net/download-10665126/XYZ.csv.html It contains 10,000 rows. Row 1 was used to replace multiple other rows in the table in order to generate duplicates. Please note that the csv still contains the 'observation column' (column 1) that includes a sequence from 1:10,000. This column of course can be deleted in the analysis, otherwise there would be no duplicates at all. Just to be exact in the definition: Duplicate means all values of a row are equal to those of another. – EDC Jun 04 '15 at 09:07

5 Answers5

5

add a conditional formatting with the following sumproduct formula (or a countifs)

=SUMPRODUCT(($A$1:$A$6&$B$1:$B$6&$C$1:$C$6=$A1&$B1&$C1)*1)>1

conditional formatting

Explanation:

SUMPRODUCT is handy to work with ranges which you need to manipulate prior to checking a condition. In this case I concatenate A, B & C columns across the range and compare it with the concatenation of the current row. I then convert the TRUE/FALSE array to a 1/0 array by multiplying by 1 and the SUM part of SUMPRODUCT sums the rows where the condition is true, giving me the duplicate rows (all occurences). If you have a small range, using the formula evaluation you can clearly see how this works.

It's a quick fix, but performance is not ideal, I use it a lot for detecting duplicates or generating sequential numbers.

Solution from comments suggested by ponydeer - higher performance

based on sorting suggesting, requires to add key column, put in auto filters and sort on key, then do conditional on key column:

Sorted

Vincent De Smet
  • 4,859
  • 2
  • 34
  • 41
  • not sure why down voted, the solution works - although it might not be best performance for large worksheet (turn automatic calculation off while working on the sheet and use `SHIFT`+`F9` to refresh) – Vincent De Smet May 31 '15 at 15:29
  • 2
    ++ for the solution. I may have used the more efficient [COUNTIFS function](https://support.office.com/en-us/article/COUNTIFS-function-53C4DC8E-0E5B-4E32-93DF-9CA5E7DA89ED) myself; perhaps on a named range wich defined the .CurrentRegion. –  May 31 '15 at 15:57
  • I did not downvote it. But unfortunately it doesn't work because my columns do also contain strings. Sorry I should have mentioned that. – EDC May 31 '15 at 16:12
  • 1
    @EDC - Try it with `=countifs(($A:$A, $A1, $B:$B, $B1, $C:$C, $C1, $D:$D, $D1)>1` as the CF rule formula. –  May 31 '15 at 16:42
  • @Jeeped this probably works but I need to test it in VBA first. As I edited above, I need a VBA solution, not an Excel one. But ty. – EDC May 31 '15 at 16:44
  • If you want to keep original sorting, with VBA solution, I'd build a dictionary (hashset) of key->counter, the key has to be a unique identifier for the whole row (I used a concat of the values, this does not work if you have a row `11,1,1` and `1,11,1` both concat to the same key). You'd iterate the range once, highlight rows for which the key exists in the hashset as you add them, so performance would be good, but memory usage could be high. – Vincent De Smet May 31 '15 at 19:15
  • @VincentDeSmet that sounds good but tbh I don't know how to work with dictionaries. I have just started with VBA a couple of days ago and the learning process is quite tedious. But thanks for the hint anyways. – EDC Jun 03 '15 at 19:54
3

Sort your range first regarding all columns

 Workbooks(1).Sheets(1).Range("A:C").Sort Key1:=Workbooks(1).Sheets(1).Range("A:A"), Order1:=xlAscending, Key2:=Workbooks(1).Sheets(1).Range("B:B"), Order2:=xlAscending, Key3:=Workbooks(1).Sheets(1).Range("C:C"), Order3:=xlAscending, Orientation:=xlSortRows

Then loop through all rows and compare them with the one above them

 Dim a As Application
 Set a = Application

 For i=1 to 1000 ' here you need to set the number of rows you have
   if Join(a.Transpose(a.Transpose(ActiveSheet.Rows(i).Value)), Chr(0)) = _
   Join(a.Transpose(a.Transpose(Sheets(1).Rows(i+1).Value)), Chr(0)) then

      Sheets(1).Range(i+1 & ":" & i+1).EntireRow.Interior.Color = 49407

   end if

 Next i

The comparison of two rows is based on this thread: How to compare two entire rows in a sheet

Please insert the names of your Workbook, Sheet and set your range and the limits in the code yourself.

Community
  • 1
  • 1
  • Thanks I will try that out! – EDC May 31 '15 at 16:14
  • One thing I just noticed: Can I revert the sort afterwards? I would like to keep the original order because it contains information about time (upper rows were created earlier etc). – EDC May 31 '15 at 16:24
  • maybe the last post of this thread, is what you are looking for http://windowssecrets.com/forums/showthread.php/8265-UNDO-AN-ACTION-IN-VBA –  May 31 '15 at 16:29
3

I have tested 3 different approaches on the sample file link from OP's comment. Probably the VBA implementations were not optimal, but below are the results with average time of 100 passes:

1) Conditional formatting using:

a)SUMPRODUCT concatenating columns - 3s

b) COUNTIFS with full column reference - 1.9s

c) COUNTIFS referencing used ranges - 0.2s

2) Sorting the range on all columns, comparing row by row, sorting back - 0.3s

3) Using advanced filter 3.5s

Here is the code for the fastest method:

Sub CF1()

    Application.ScreenUpdating = False

    Dim sFormula As String
    Dim rRng As Range
    Dim nCol As Integer, i As Integer

    Set rRng = Range("A1").CurrentRegion
    nCol = rRng.Columns.Count

    'build the formula
    sFormula = "=COUNTIFS("

    For i = 1 To nCol
      sFormula = sFormula & rRng.Columns(i).Address & "," & _
         rRng.Cells(1, i).Address(False, True)
      If i < nCol Then sFormula = sFormula & ","
    Next
    sFormula = sFormula & ")>1"

    'write the formula in helper cell to get it's local version
    rRng.Cells(1, nCol + 1).Formula = sFormula

    rRng.FormatConditions.Delete
    With rRng.FormatConditions.Add(Type:=xlExpression, _
            Formula1:=rRng.Cells(1, nCol + 1).FormulaLocal)
       .Interior.ThemeColor = xlThemeColorAccent3
    End With

    rRng.Cells(1, nCol + 1).Clear

    Application.ScreenUpdating = True
End Sub
BrakNicku
  • 5,935
  • 3
  • 24
  • 38
  • thanks for your time and effort! Great sol. Could you provide the code for the advanced filters too? – EDC Jun 04 '15 at 16:21
  • 1
    sweet stuff! could you also release your benchmark code, I'm interested in VBA approach of using a dictionary and testing if existing keys (although the dictionary object will be called through COM, so that might actually be slow as well) – Vincent De Smet Jun 05 '15 at 08:52
  • @VincentDeSmet the _benchmark_ code was pretty simple - `GetTickCount` API call before and after running a loop of 100 calls of each Sub. I'm not sure if it's a best approach, but I think it was enough to get general speed comparison. – BrakNicku Jun 05 '15 at 08:58
2

I think fastest/best will depend upon the proportion of duplicates – only one row should be quicker than 50% as in the example – and on the actual size of the array (how many columns from which to create a key, etc).

Given that it is rarely possible to beat inbuilt functions with ‘pure’ VBA I suspect using the UI, within VBA if desired, will be faster in some circumstances. Eg:

Add an index column (series fill would serve), copy entire sheet (say to Sheet2), apply Remove Duplicates to all but index column, then apply as CF formula rule of this kind the relevant range of the original sheet:

=$A1=MATCH($A1,Sheet2!$A$1:$A$3000,0)>0  

Assuming the start point is like so:

SO30558893 first example

and a ColumnA inserted with numeric series fill starting 1, Sheet2 should look so after Remove Duplicates:

SO30558893 second example

I have assumed ColumnE is to be ignored as far as duplication is concerned.

In source sheet, select array (from A1: - see!), eg A1:I6 and HOME >Styles - Conditional Formatting, New Rule..., Use a formula to determine which cells to format, Format values where this formula is true::

=$A1=MATCH($A1,Sheet2!$A:$A,0)>0  

Format..., Fill, grey, OK, OK.

For me results in:

SO30558893 third example

Community
  • 1
  • 1
pnuts
  • 58,317
  • 11
  • 87
  • 139
  • sounds interesting, but I don't know what you mean by this: 'apply as CF formula rule of this kind the relevant range of the original sheet: `=$A1=MATCH($A1,Sheet2!$A$1:$A$3000,0)>0`'. After I removed the duplicates I would be done, or not? – EDC May 31 '15 at 16:17
  • 1
    I think in this approach CF will highlight only one of two identical rows (or n-1 of n) – BrakNicku May 31 '15 at 16:33
  • If it's OK to highlight only `duplicates` (not `original`), one more solution is possible - use `advanced filter`->`unique values only` and apply `original` formatting to all visible cells. – BrakNicku May 31 '15 at 17:06
  • Sorry but I don't understand _I tend to prefer Sort & Filter - Advanced_, because that's exactly what I meant. I do not use English version of excel and it's not my native language, so maybe I wasn't clear enough – BrakNicku May 31 '15 at 17:21
  • 1
    @pnuts Just wanted to add that's exactly what I was looking for. If (advanced) filters were used, the solution would assumably be much faster compared to loops. – EDC May 31 '15 at 17:57
  • @user3964075 Do you know how to solve this using advanced filters? I would be very interested. – EDC Jun 03 '15 at 19:59
0
Sub HighlightDuplicateRows_2()

  'This one is more modifiable and can handle multiple columns of data
  'Just add another *WorksheetFunction.CountIf(Range("A2:A" & LastRow),Cells(r,1).Value) > 1* and change the column values

  Dim LastRow As Long

  LastRow = Cells(Rows.Count, "A").End(xlUp).Row

  For r = 2 To LastRow

    If WorksheetFunction.CountIf(Range("A2:A" & LastRow), Cells(r, 1).Value) > 1 And WorksheetFunction.CountIf(Range("B2:B" & LastRow), Cells(r, 2).Value) > 1 Then
    Rows(r).Interior.ColorIndex = 6
    End If

  Next r

End Sub
gawi
  • 2,843
  • 4
  • 29
  • 44