SSIS - Find duplicate rows and keep the one with the highest value

Question

I have a flat file, it looks like this

┌───────┬──────────┬──────────┐
│ ITEM  │ QTY_ON_HAND LOCATION│
├───────┼──────────┼──────────┤
│ BOX   │        1 │ A        │
│ BOX   │        0 │ B        │
│ THING │        1 │ C        │
│ OTHER │        2 │ A        │
└───────┴──────────┴──────────┘

ITEM can have duplicates, I need to keep the one where QTY_ONHAND is highest (first BOX in example above), if both have the same value, then I would keep any occurrence.

Approaches:

Using Sort + Remove rows with duplicate sort values. Sort will delete random occurrences, I can't guarantee that the one with highest value will remain
Script Component. I can detect duplicates in a Script Component similar to this answer but I'd have to store all the rows in memory, compare each row against all the stored set of rows, keep the highest, remove the duplicates and somehow return only the valid rows, sounds very inefficient, however, there are not too many rows so I'll consider it (I'm still not sure how to achieve this).
Aggregate Transformation. I understand I can group the rows by ITEM and use the MAX operation to keep the one with highest value, this is the one I'm trying to do right now, but I'm stuck.

Sources:

SSIS: Flat File Source to SQL without Duplicate Rows

https://social.msdn.microsoft.com/Forums/sqlserver/en-US/581df31a-5cd0-4802-a472-7ee85c6e8464/how-does-remove-duplicates-option-in-sort-transformation-behaves?forum=sqlintegrationservices

Remove duplicate in SSIS package with preference over a column data

Aggregate transformation should do the trick. What is your problem? — Ferdipux, Oct 23 '19 at 05:31
@Ferdipux Yes, it works with Aggregate Transformation, I was having trouble with one particular column, but it all worked at the end, thanks — The One, Oct 23 '19 at 20:44
If you MAX on LOCATION, you get location `B` for BOX, not `A` as required. — Nick.Mc, Oct 24 '19 at 10:29
Thanks, @Nick.McDermaid, but I'm using MAX on Quantity not on Location — The One, Oct 24 '19 at 11:38
Then you must be grouping by location which means you get two rows? Or does SSIS aggregate not work that way? — Nick.Mc, Oct 24 '19 at 12:28
@Nick.McDermaid, I'm not grouping by Location, I'm ignoring that one — The One, Oct 24 '19 at 12:39

The One · Answer 1 · 2019-10-24T13:27:07.773

1

The solution is straightforward, you just add an Aggregate Transformation

Group by all the other columns you know are similar for all the rows in the group, and apply the Maximum operation for the column you want to use to get the row with the highest value.

I'm not grouping by Location, nor any other column with values I know will be different for each row, I don't need Location further in the flow, but if I did, then ~~I suppose I could use more Aggregate Transformations~~ I don't know.

edited Oct 24 '19 at 13:27

answered Oct 23 '19 at 20:50

The One

4,560
5
36
52

this does work, but it requires grouping by everything being returned except the column we need the max for, which is ok as long as that's practical for the query being run. The join approach I posted doesn't require the grouping & I feel is slightly more robust as queries are being developed. – simon coleman Oct 24 '19 at 10:22
But this will return `BOX | 1 | B` for the sample dataset, not the required `BOX | 1 | A` – Nick.Mc Oct 24 '19 at 10:28
@Nick.McDermaid I'm using Max on QTY – The One Oct 24 '19 at 11:51
A single query doesn't produce the right answer, as aggregates are applied on a PER COLUMN basis. If Location is added to the Group By clause then the results contain separate entries for both locations: Box/1/A and Box/0/B. If the Location is omitted from the group by then the corect quantity is produced, but no information about which location is available. To get that data these results must be joined back to the original table based on Item and Qty. – simon coleman Oct 24 '19 at 12:59

simon coleman · Answer 2 · 2019-10-24T13:40:37.530

1

In SSIS to do this directly from a flat file I think is tricky, as an SSIS aggregate step will not produce the correct result on its own (either producing multiple rows or failing to identify the correct location)

However, it can can be done in a SQL db without much effort, using a partial left join and then filtering for the topmost row. SSIS can be used to import the data to a staging table first and run the query from there.

The SQL is:

select 
   i1.item,
   i1.quantity,
   i1.location 
from 
   item as i1 
   left outer join item as i2 on i1.item = i2.item and i1.quantity<i2.quantity
where 
   i2.quantity is null

This stage/query approach may be better if you subsequently reuse the data for further processing.

If there is a large amount of data it would also be worth considering whether indexing the data after load improves performance, particularly if the flat file itself is already sorted and that aspect of index build can be skipped.

There are a couple of other ways of deriving this in the db, but this one is fairly efficient and integrates well into more complex queries.

edited Oct 24 '19 at 13:40

answered Oct 24 '19 at 10:15

simon coleman

349
2
5

How do I apply this SQL Query to the flat file source in SSIS? – The One Oct 24 '19 at 11:40
To run queries you'll need the data inside a db, rather than just in the flat file - import to a staging table and query from there. I don't know how the aggregate grouping approach also suggested handles flat file data. If it is a re-implementation of the SQL Grouping feature capable of operating on the input stream then it *may* be more efficient than loading the data & then querying afterwards in the db, although the stage/query approach may be better if you subsequently reuse the data for further processing. – simon coleman Oct 24 '19 at 12:50
That's a valid approach, but the answer seems incomplete, if you move that comment to the answer I'll upvote it – The One Oct 24 '19 at 12:52

SSIS - Find duplicate rows and keep the one with the highest value

2 Answers2