0

I have a spreadsheet with over 100k rows - each row has a document name (file name) and a revision code. Some documents are listed twice within the same column however that is because the second time I received the document it was a revision of the document which is noted on column B. For example,

      File Name                Revision  
25697-100-3DR-J04F-00003         002
25697-100-3DR-J04F-00003         001   

What I am looking to have excel do is to only keep the row with the latest copy of the file. In this example I would only want to keep the row that contains revision 002 and delete the entire row that contains revision 001. Also, there are some files that have been listed 9 times due to 9 revisions having been made to that file. ANY help on this is greatly appreciated!!

Robin Mackenzie
  • 18,801
  • 7
  • 38
  • 56
Pvelez_3
  • 115
  • 9
  • 12
    Sort descending on revision then use Remove Duplicates on the Data tab. – Scott Craner Mar 06 '17 at 20:17
  • If you are looking for speed then you might want to consider using arrays: http://stackoverflow.com/questions/36873359/fastest-way-to-delete-rows-which-cannot-be-grabbed-with-specialcells – Ralph Mar 06 '17 at 20:23
  • @ You have to sort descending first on File Name and then Revision, then use remove duplicates on Data tab – Ibo Mar 06 '17 at 20:36
  • 1
    @Ibo - as long as Revision is in descending order, it won't matter what order File Name is in. Even though the file names won't be in any specific order, if the revisions are in descending sequence, the latest revision for any specific file name will occur before any other occurrence of that same file name. – YowE3K Mar 06 '17 at 20:41
  • @YowE3K I think you are assuming that similar documents are next to each other. If it is not the case, wrong versions will be deleted – Ibo Mar 06 '17 at 20:51
  • 1
    @Ibo - No, the first occurrence (i.e. latest revision) of each filename will be kept, irrespective of how many rows until the next occurrence (i.e. earlier revision). – YowE3K Mar 06 '17 at 20:54
  • @Ibo the duplicates do not need to be in congruent cells to be considered duplicates. It could be in the first row and the 100,000th row with different data between and the 100,000th will be removed. – Scott Craner Mar 06 '17 at 20:56
  • Ok, I got it. You are right, the only thing is that I was selecting only documents column to delete without expanding, which was a wrong approach. I noticed that both of the columns should be selected and the dupl;icates must be only based on column A. Learned sth! :) Also, it indicates that if we don't know how exactly the function works, it can give us wrong results! – Ibo Mar 06 '17 at 21:10
  • I just noticed that the best practice is to select both columns when sorting and when deleting the duplicates, we just need to tell Excel to remove duplicates only base on column A and not both A and B. Also, the header row must be there to function properly! – Ibo Mar 06 '17 at 21:13
  • @Ibo nope the header row is not needed, both the sort and the remove duplicates has the option to include or exclude a header row. – Scott Craner Mar 06 '17 at 21:15
  • @Scott Craner I am using Excel 2010 and if I use A->Z under Data tab it does not ask about the existence of the header. There is such an option if the SORT function is chosen which launches the sort user form and it has a checkbox for the header. – Ibo Mar 06 '17 at 21:27
  • 1
    The buttons are tied to the Sort button options. If you have that option check under the sort button then it will assume headers, if it is unchecked, the A-Z and Z-A buttons will assume no header. So you can still set the header option on the the shortcut buttons. @Ibo – Scott Craner Mar 06 '17 at 21:31
  • You could also create a recordset of the data and run a query against the data. Then use CopyFromRecordset since the data is large. That way the recordset should contain only the records you want and you can clear the original data and overwrite. – PaulG Mar 06 '17 at 22:04

1 Answers1

0

Not VBA, but here is an array formula to mark cells for deletion. Filename order doesn't matter. Assuming Filename is column a and revision is column b, put this in column c. Because it is an array formula it needs curly brackets (use ctrl shift enter).

{=IF(B2=MAX(IF($A$2:$A$11=A2,$B$2:$B$11)),"Keep","Delete")}

Output:

File Name Revision Keep or Delete a 1 Delete a 2 Delete a 3 Keep b 1 Delete b 2 Delete b 3 Keep c 1 Delete c 2 Delete c 3 Keep d 4 Keep

The formula is getting the maximum revision for each filename, and checking if the current row is the maximum.

Max xaM
  • 348
  • 2
  • 12