2

I want to identify duplicates that have the same "CustomerName" "ZipCode," and "DOB," and only keep the record with the oldest "Date."

For instance, below, I want to remove the second John Smith record as it is the newer record.

TransID Date CustomerName ZipCode DOB Email
5X42 04/13/18 John Smith 90210 11/23/87 johnsmith@gmail.com
7YC5 01/27/21 John Smith 90210 11/23/87 johnsmith@gmail.com
28N2 11/17/20 Mary Jones 56451 05/16/91 maryjones@gmail.com

I have attempted to use MAXIFS() with limited luck.

Scott Craner
  • 148,073
  • 10
  • 49
  • 81
Matthew L
  • 21
  • 2
  • 3
    Sort your data on the date ascending, then use Remove Duplicate, making sure to include all your data and use the three column as the criteria. – Scott Craner Mar 01 '23 at 14:52

2 Answers2

1

You can achieve it by ranking by Date column within the same group, where the group is represented by the rows with the same primary key. In your case a combination of the following columns: CustomerName, ZipCode, and DOB. You can use the following formula in cell A6:

=LET(in, A2:F4,pks, INDEX(in,,3)&"_"&INDEX(in,,4)&"_"&INDEX(in,,5),
 dates, INDEX(in,,2), rnk, MAP(pks, dates,LAMBDA(pk,d, 
  SUM((pks=pk) * (dates < d))+1)), FILTER(in, rnk=1))

Here is the output: excel output

The name rnk, ranks each element of the same group, assigning the value 1 to the oldest date. This approach doesn't require sorting the data, SUM calculation ensures it. So the output is presented in the same order as the input, just removing duplicates.

Note: In this case, there is no risk of false positives by concatenation (in the way it was built the primary key), because the delimiter (_) cannot be present in the dates or zipcodes. Check the comment section from @JvdV's answer to this question: Finding pairs of cells in two columns. Just for fun, we can avoid using concatenation and keep the same approach, via MMULT to identify the groups (grs) for each row, where grId is the group id, in our case we have only 1,2.

=LET(in, A2:F4, pks, CHOOSECOLS(in,3,4,5), n, COLUMNS(pks), dates, INDEX(in,,2),
 ux, UNIQUE(pks), grId, SEQUENCE(ROWS(ux)), ones, SEQUENCE(n,,1,0),
 grs, BYROW(pks, LAMBDA(pk, FILTER(grId, MMULT(N(pk=ux), ones)=n))),
 rnk, MAP(grs, dates,LAMBDA(g,d, SUM((grs=g)*(dates < d))+1)),FILTER(in,rnk=1))

Notice, this approach doesn't need to sort the input data.

David Leal
  • 6,373
  • 4
  • 29
  • 56
0

Remove Duplicates Based on Multiple Columns

=LET(tData,A1:F4,DateColumn,2,UniqueColumns,{3;4;5},Delimiter,"|!|",
    Head,TAKE(tData,1),Data,DROP(tData,1),rSeq,SEQUENCE(ROWS(Data)),
    Sorted,SORT(HSTACK(rSeq,Data),DateColumn+1),uData,CHOOSECOLS(Sorted,UniqueColumns+1),
    jData,BYROW(uData,LAMBDA(uRow,TEXTJOIN(Delimiter,,uRow))),rIndexes,XMATCH(UNIQUE(jData),jData),
    rData,INDEX(Sorted,rIndexes,SEQUENCE(,COLUMNS(Sorted))),
VSTACK(Head,DROP(SORT(rData,1),,1)))

Highlights

  • Write the constants to variables.
  • Use TAKE and DROP to return the headers (Head) and data (Data) in arrays.
  • Use SEQUENCE to return an ascending integer sequence in an array (rSeq).
  • Use HSTACK to stack the data to it, to keep track of the order, and sort it by the date column ascending (Sorted).
  • Use CHOOSECOLS to return the unique columns in an array (uData).
  • Use BYROW with TEXTJOIN to return the concatenated (joined) unique columns in a single-column array (jData).
  • Use XMATCH and UNIQUE to return an array of the unique row indexes (rIndexes).
  • Use INDEX on the sorted data with the row indexes to remove duplicates (rData).
  • Restore the initial order by sorting the data ascending by the stacked integer sequence column (1).
  • Remove the column and stack the data to the headers (Head).
VBasic2008
  • 44,888
  • 5
  • 17
  • 28