Finding the min value based on the number of duplicates in a row

Question

I am working on a file with thousands of lines and need to find the minimum corresponding value based on duplicate values in another column

Brand	Model	Price
Toyota	Tundra	20360
Toyota	Corolla	28502
Ford	F-350	36555
Ford	F-150	28963
Ford	Expedition	27585
Tesla	Model 3	25812
Mazda	Mazda3	22036
Mazda	Mazda6	25875
Mazda	CX9	58200

What I need to for the formula to look to duplicate values in column A, regardless of whether its a single value or 3 duplicates, and then return the lowest cost.

Final result of what I'm looking for

Brand	Model	Price
Toyota	Tundra	20360
Ford	Expedition	27585
Tesla	Model 3	25812
Mazda	Mazda3	22036

Does this answer your question? [Excel - extracting top 5 values](https://stackoverflow.com/questions/58639470/excel-extracting-top-5-values) — Solar Mike, Apr 03 '23 at 17:24
Hmm. Perhaps sort by Brand and then Price descending. Then write a formula that indicates when the Brand is not equal to the row above it. Those are the ones you want to keep. — Steven Rumbalski, Apr 03 '23 at 17:27

Tom Sharpe · Answer 1 · 2023-04-03T21:53:49.557

4

I guess it's worth adding an older-style countifs approach:

=FILTER(A2:C10,COUNTIFS(A2:A10,A2:A10,C2:C10,"<"&C2:C10)=0)

Will it be slow if OP has 'thousands of lines'? Yes, probably.

edited Apr 03 '23 at 21:53

answered Apr 03 '23 at 21:48

Tom Sharpe

30,727
5
24
37

1

Definitely worth it. Other answers will be just as slow I suppose (if not slower). – JvdV Apr 04 '23 at 07:26
1

Tried it on 10K rows, mine takes a second or so, yours is instantaneous. – Tom Sharpe Apr 04 '23 at 07:57
2

Who would have thought, `LAMBDA()` keeps surprising me. Mostly slow, quick at times =) – JvdV Apr 04 '23 at 07:59
I am surprised too, I tested it for 10K rows too: Tom (0.77s), JvdV (0.06), and mine (21s). I am using a `LAMBDA()` too (`MAP`) but it gives the worse performance. – David Leal Apr 06 '23 at 13:47
Just for curiosity, I tested this one: `=LET(in,A2:C10,A,TAKE(in,,1),C,TAKE(in,,-1), ux, UNIQUE(A),m, MINIFS(C,A,ux),FILTER(in,1-ISNA(XMATCH(A&"_"&C,ux&"_"&m))))` is the fastest one. I tested for `100K` rows: JvdV(0.54s) and this solution (0.24s), the `LAMBDA` solution is incredibly fast anyway. To be aware of possible false positives due to concatenation, but not really based on the input data string pattern. I didn't test the other solutions, because the performance was lower already for fewer rows. – David Leal Apr 06 '23 at 15:04

JvdV · Answer 2 · 2023-04-03T20:54:40.180

3

One option:

Formula in E1:

=LET(x,TOROW(UNIQUE(A2:A10)),FILTER(A2:C10,BYROW((A2:A10=x)*(C2:C10=MINIFS(C2:C10,A2:A10,x)),LAMBDA(z,SUM(z)))))

edited Apr 03 '23 at 20:54

answered Apr 03 '23 at 18:27

JvdV

70,606
8
39
70

1

Why `(y)` and not `y`? – Jos Woolley Apr 03 '23 at 19:03
@JosWoolley, typo I suppose =). Ill get rid of that whole variable. Not needed. – JvdV Apr 03 '23 at 20:54
1

This really helped me today. @JvdV Sir. – Mayukh Bhattacharya Jun 26 '23 at 11:41

score 3 · Answer 3 · answered Apr 04 '23 at 05:21

Cheapest Car By Make

Here's a REDUCE version. Hopefully, there aren't too many makes.

=LET(d,A2:C10,bc,1,pc,3,
    b,INDEX(d,,bc),u,UNIQUE(b),
DROP(REDUCE("",SEQUENCE(ROWS(u)),LAMBDA(rr,r,
    VSTACK(rr,TAKE(SORT(FILTER(d,b=INDEX(u,r),""),pc),1)))),1))

The Variables

d - the data (range)
bc - the make column index
pc - the price column index
b - the make column
u - the unique makes
rr - the current Reduce result
r - the current Reduce row

The Flow

For each make, the data is FILTERed by the UNIQUE make at the row INDEX supplied from the SEQUENCE function, SORTed, and only the first row is TAKEn.
Each row is VSTACKed to the bottom of the previous ones, the REDUCEd result, and finally, the initial erroring row is DROPped.

This was fast when tested on 10K rows (I tried a similar approach with the the sort outside the reduce and the result was dismal, as we found previously). — Tom Sharpe, Apr 04 '23 at 09:02

David Leal · Answer 4 · 2023-04-06T18:23:38.647

Here is another option, it is based on ranking by price within each group (brand) formula 1:

=LET(in,A2:C10,A,TAKE(in,,1),C,TAKE(in,,-1),
 r, MAP(A,C,LAMBDA(x,y,SUM((A=x)*(C<y))+1)),FILTER(in,r=1))

Here is the output: Note: I changed intentionally the price of Ford Expeditor to the same lowest price of Mazda and also interchanged Mazda3 with Tundra for testing purposes, to make sure the order by the brand is not relevant and that it ranks by group properly.

For illustrative purposes, here is the output of r (rank), so we can see how it works:

The good thing about this approach is that with minimal changes, you can change the filter criteria. For example to get the two lowest prices by brand, just filter by 2.

If you want to rank by maximum price instead, then change (C < y) with (C > y), so it is additional flexibility this approach has.

Performance

See comments on VBasic2008 and Tom Sharpe's answers. The above solution (formula 1) for a large dataset of around 10K rows, shows lower performance (54secs) compared to JvdV and Tom solutions (fraction of a second). Being JvdV the fastest one. I didn't consider VBasic2008's solution because we know from other questions/answers that REDUCE/VSTACK has a significant time consumption. Then I found the following approach (formula 2):

=LET(in,A2:C10,A,TAKE(in,,1),C,TAKE(in,,-1), ux, UNIQUE(A),m, 
 MINIFS(C,A,ux),FILTER(in,1-ISNA(XMATCH(A&"_"&C,ux&"_"&m))))

which is easy to understand and very fast. Concatenation may produce false positives, which is why it is a good practice to use a delimiter (_) that is not part of the dataset character set. Check JvdV answer (comment section) to this question: Finding pairs of cells in two columns.

Later I found it is possible to avoid concatenation as follows with a similar performance result as JvdV's solution (but not better than the XMATCH approach) formula 3:

=LET(in,A2:C10,A,TAKE(in,,1),C,TAKE(in,,-1),ux,UNIQUE(A),m,MINIFS(C,A,ux),
 FILTER(in,MMULT(N(A=TOROW(ux))*N(C=TOROW(m)),SEQUENCE(ROWS(m),,,0))=1))

Note: If you look carefully at this approach it ends up in JvdV's approach, but using MMULT instead of BYROW, which may explain why both have a similar performance. I just added the comparison =1, but it is not really necessary.

Here testing result I did for 100K rows:

formula 2 (`XMATCH`)	JvdV (`BYROW`)	formula 3 (`MMULT`)
`0.24secs`	`0.54secs`	`0.53secs`

Worth to mention that even using BYROW which is a LAMBDA function, that usually is more time-consuming, it is about double, but anyway really fast too.

This is also good solution, but i couldn't make out and had to use `Thunks` since I have two different data in different sheets. — Mayukh Bhattacharya, Jun 26 '23 at 11:48
@MayukhBhattacharya share your Excel file so I can take a look at it. Thanks — David Leal, Jun 26 '23 at 12:26
@MayukhBhattacharya I added my solution adapted to your sample data in `F12`, I don't see any difference from the solution in `F2`, other than different sorting (I don't change the order of the input data) and the empty price because it is lower (treated as zero) for Paracetamol Tablet. — David Leal, Jun 26 '23 at 13:10

Finding the min value based on the number of duplicates in a row

4 Answers4

Cheapest Car By Make

Performance