1

I have situation where in I want to find duplicate value where not all column data is same in SSIS.

I am aware of doing it using sort and aggregate method but i guess that can be done in case where all column data is same

ID             | Start Date | End Time | Queue Time | Talk Time|
===============|============|==========|============|==========|
33000017670    |9/4/2017    |9/4/2017  |0:00:10     |0:03:30   |
33000017672    |9/4/2017    |9/4/2017  |0:00:10     |0:03:30   |
33000017672    |9/4/2017    |9/4/2017  |0:00:12     |0:00:00   |
33000017673    |9/4/2017    |9/4/2017  |0:00:12     |0:05:00   |
33000017674    |9/4/2017    |9/4/2017  |0:00:12     |0:12:00   |
33000017675    |9/5/2017    |9/5/2017  |0:01:12     |0:00:00   |
33000017675    |9/5/2017    |9/5/2017  |0:01:12     |0:00:00   |

Here are couple case that I want to handle in SSIS

CASE 1

So as you can see here id 33000017672 is coming twice and it is a primary key in the table that I am loading this data. The source of it is Excel. I am aware of removing this record before loading but I want to remove that process.

Now here not all the record column data is same. I want to find such record remove id which have Talk Time as 0

Case 2

Also in case of record 33000017675, all the fields are same. So in this case I want to keep on record. Note: In this case there could be more than record with same data. So I want to keep just one out of that.

Can someone help me how I can do it in SSIS

  • How can it be a `PRIMARY KEY` if the data in the column is duplicated? Fix the source – Mazhar Feb 01 '18 at 15:39
  • I am loading the file from excel. Which has duplicate records. I can remove that manually but the number of records is huge so I want to do it in SSIS and load into SQL server table – Abdulquadir Shaikh Feb 01 '18 at 15:44
  • If your loading from Excel than one option would be to stage the data to a table. Then use an OLE DB source query connection and write a query that selects distinct to remove the duplicate rows and then use a ranking function to rank the rows with identical ID's. You can order the rank by queue time and choose where the rank = 1. – tonyd Feb 01 '18 at 16:30
  • You need to decide which record you want to throw out. ID 33000017672 has two different talk times and queue times. Which one is incorrect? – Nick.Mc Feb 02 '18 at 05:37
  • Nick.McDermaid - I want to keep the record which has some talk time value. – Abdulquadir Shaikh Feb 02 '18 at 23:06

3 Answers3

2

You can use the SORT transformation to eliminate duplicates. Just add every column as input and check the "Remove rows with duplicate sort values" option.

Will_C
  • 108
  • 5
  • For the duplicate IDs with different talk times, I would load the results after the SORT transformation into a stage table, then use that table as a source in a new data flow task and perform a LOOKUP transformation with the connection being a SQL query to identify IDs with a COUNT > 1. Send the Lookup match output to a table to check manually, and the Lookup no match output to your final table. – Will_C Feb 01 '18 at 17:10
0

If all you want is to remove duplicate records then a SORT transformation as recommended by Will_C will work.

If additionally you want to remove records with a talk time of 0, you can perform a conditional split on your talk_time column, directing records with a time greater than 0 to your table, and deleting the rest.

With that said, unless you are sure that the only time your primary key will be duplicated is with records that have a talk time of 0 which will be removed, you may very likely run into primary key constraints.

Andrew O'Brien
  • 1,793
  • 1
  • 12
  • 24
0

We don't any good reason to use SSIS for removing duplicate duplicate records unless it is interview question?

Reason being to perform any such compression ssis package need to use some blocking component like sort.

It would be better if first all excel data loaded in Database and remove duplicate rows there are number of ways to do .

sandeep rawat
  • 4,797
  • 1
  • 18
  • 36