Identify and remove remove duplicates from three columns and keep oldest - Excel

Question

I want to identify duplicates that have the same "CustomerName" "ZipCode," and "DOB," and only keep the record with the oldest "Date."

For instance, below, I want to remove the second John Smith record as it is the newer record.

TransID	Date	CustomerName	ZipCode	DOB	Email
5X42	04/13/18	John Smith	90210	11/23/87	johnsmith@gmail.com
7YC5	01/27/21	John Smith	90210	11/23/87	johnsmith@gmail.com
28N2	11/17/20	Mary Jones	56451	05/16/91	maryjones@gmail.com

I have attempted to use MAXIFS() with limited luck.

Sort your data on the date ascending, then use Remove Duplicate, making sure to include all your data and use the three column as the criteria. — Scott Craner, Mar 01 '23 at 14:52

David Leal · Answer 1 · 2023-03-02T03:48:45.637

You can achieve it by ranking by Date column within the same group, where the group is represented by the rows with the same primary key. In your case a combination of the following columns: CustomerName, ZipCode, and DOB. You can use the following formula in cell A6:

=LET(in, A2:F4,pks, INDEX(in,,3)&"_"&INDEX(in,,4)&"_"&INDEX(in,,5),
 dates, INDEX(in,,2), rnk, MAP(pks, dates,LAMBDA(pk,d, 
  SUM((pks=pk) * (dates < d))+1)), FILTER(in, rnk=1))

Here is the output:

The name rnk, ranks each element of the same group, assigning the value 1 to the oldest date. This approach doesn't require sorting the data, SUM calculation ensures it. So the output is presented in the same order as the input, just removing duplicates.

Note: In this case, there is no risk of false positives by concatenation (in the way it was built the primary key), because the delimiter (_) cannot be present in the dates or zipcodes. Check the comment section from @JvdV's answer to this question: Finding pairs of cells in two columns. Just for fun, we can avoid using concatenation and keep the same approach, via MMULT to identify the groups (grs) for each row, where grId is the group id, in our case we have only 1,2.

=LET(in, A2:F4, pks, CHOOSECOLS(in,3,4,5), n, COLUMNS(pks), dates, INDEX(in,,2),
 ux, UNIQUE(pks), grId, SEQUENCE(ROWS(ux)), ones, SEQUENCE(n,,1,0),
 grs, BYROW(pks, LAMBDA(pk, FILTER(grId, MMULT(N(pk=ux), ones)=n))),
 rnk, MAP(grs, dates,LAMBDA(g,d, SUM((grs=g)*(dates < d))+1)),FILTER(in,rnk=1))

Notice, this approach doesn't need to sort the input data.

thanks, @VBasic2008!, the question recall me something similar I did for calculating ranks within groups — David Leal, Mar 02 '23 at 03:53

VBasic2008 · Answer 2 · 2023-03-01T16:53:04.117

Remove Duplicates Based on Multiple Columns

=LET(tData,A1:F4,DateColumn,2,UniqueColumns,{3;4;5},Delimiter,"|!|",
    Head,TAKE(tData,1),Data,DROP(tData,1),rSeq,SEQUENCE(ROWS(Data)),
    Sorted,SORT(HSTACK(rSeq,Data),DateColumn+1),uData,CHOOSECOLS(Sorted,UniqueColumns+1),
    jData,BYROW(uData,LAMBDA(uRow,TEXTJOIN(Delimiter,,uRow))),rIndexes,XMATCH(UNIQUE(jData),jData),
    rData,INDEX(Sorted,rIndexes,SEQUENCE(,COLUMNS(Sorted))),
VSTACK(Head,DROP(SORT(rData,1),,1)))

Highlights

Write the constants to variables.
Use TAKE and DROP to return the headers (Head) and data (Data) in arrays.
Use SEQUENCE to return an ascending integer sequence in an array (rSeq).
Use HSTACK to stack the data to it, to keep track of the order, and sort it by the date column ascending (Sorted).
Use CHOOSECOLS to return the unique columns in an array (uData).
Use BYROW with TEXTJOIN to return the concatenated (joined) unique columns in a single-column array (jData).
Use XMATCH and UNIQUE to return an array of the unique row indexes (rIndexes).
Use INDEX on the sorted data with the row indexes to remove duplicates (rData).
Restore the initial order by sorting the data ascending by the stacked integer sequence column (1).
Remove the column and stack the data to the headers (Head).

Identify and remove remove duplicates from three columns and keep oldest - Excel

2 Answers2

Remove Duplicates Based on Multiple Columns