How to implement deduplication in a billion rows table ssis

Question

which is the best option to implement distinct operation in ssis? I have a table with more than 200 columns and contain more than 10 million rows. I need to get the ditinct rows from this table.Is it wise to use a execute sql task (with select query to deduplicate the rows) or is there any other way to achieve this in ssis

I do understood that the ssis sort component deduplicate the rows..but this is a blocking component it is not at all a good idea to use ...Please let me know your views on this

Why exactly the Sort Component is not useful? As far as I know, this is the only transformation which allows you to distinct the duplcities. Or you could use SQL-like command. — DNac, Aug 16 '13 at 12:43
Are you trying to retrieve distinct rows or to remove duplicates from the original table? — Gordon Linoff, Aug 16 '13 at 12:54
sort component sorts before it does the deduplication..so my question is whether to use sort or to use sql .which one is better — user1254579, Aug 16 '13 at 12:55
My math is a bit fuzzy, how does 10 million rows turn into a billion rows — billinkc, Aug 16 '13 at 13:13
If the sorting is problem, then you should use (assuming your source is DB) "SQL Command" in Data Access Mode specification. Select distinct your data and thats it .. you may also save a bit time as the ETL wont have to go through the Sort Component. — DNac, Aug 16 '13 at 13:15

score 1 · Answer 1 · edited May 23 '17 at 10:33

I had done it in 3 steps this way:

Dump the MillionRow table into HashDump table, which has only 2 columns: Id int identity PK, and Hash varbinary(20). This table shall be indexed on its Hash column.
Dump the HashDump table into HashUni ordered by Hash column. In between would be a Script Component that check whether the current row's Hash column value is same as the previous row. If same, direct row to Duplicate output, else Unique. This way you can log the Duplicate even if what you need is just the Unique.
Dump the MillionRow table into MillionUni table. In between would be a Lookup Component that uses HashUni to tell which row is Unique.

This method allows me to log each duplicates with a message such as: "Row 1000 is a duplicate of row 100".

I have not found a better way than this. Earlier, I made a unique index on MillionUni, to dump directly the MillionRow into it, but I was not able to use "fast load", which was way too slow.

Here is one way to populate the Hash column:

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    StringBuilder sb = new StringBuilder();
    sb.Append(Row.Col1String_IsNull ? "" : Row.Col1String); sb.Append("|");
    sb.Append(Row.Col2Num_IsNull ? "" : Row.Col2Num.ToString()); sb.Append("|");
    sb.Append(Row.Col3Date_IsNull ? "" : Row.Col3Date.ToString("yyyy-MM-dd"));
    var sha1Provider = HashAlgorithm.Create("SHA1"); 
    Row.Hash = sha1Provider.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString()));
}

If 200 columns prove to be a chore for you, part of this article shall inspire you. It is making a loop for the values of all column objects into a single string.

And to compare the Hash, use this method:

byte[] previousHash;
int previousRowNo;

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    if (StructuralComparisons.StructuralEqualityComparer.Equals(Row.Hash, previousHash))
    {
        Row.DupRowNo = previousRowNo;
        Row.DirectRowToDuplicate();
    }
    else
    {
        Row.DirectRowToUnique();
    }
    previousHash = Row.Hash;
    previousRowNo = Row.RowNo;
}

score 0 · Answer 2 · answered Aug 16 '13 at 13:02

I won't bother SSIS for it, a couple of queries will do; also you have a lot of data, so i suggest you check the execution plan before running the queries, and optimize your indexes

http://www.brijrajsingh.com/2011/03/delete-duplicate-record-but-keep.html

Check out a small article i wrote on the same topic

score 0 · Answer 3 · answered Aug 16 '13 at 13:25

As far as I know, the Sort Component is the only transformation which allows you to distinct the duplcities. Or you could use SQL-like command.
If the sorting operation is problem, then you should use (assuming your source is DB) "SQL Command" in Data Access Mode specification. Select distinct your data and that's it .. you may also save a bit time as the ETL wont have to go through the Sort Component.

enter image description here

How to implement deduplication in a billion rows table ssis

3 Answers3