2

I have a SQL Server table with three columns:

Table1

col1 int
col2 int
col3 string

I have a unique constraint defined for all three columns (col1, col2, col3)

Now, I have a .csv file from which I want to add records in this table and the *.csv file can have duplicate records.

I have searched for various options for avoiding duplicates in above scenario. Below are the three options which are working well for me. Please have a look and throw some ideas on pros/cons of each method so I can choose the best one.

Option#1 :

Avoiding duplicates in the first place i.e. while adding objects to the list from csv file. I have used HashSet<T> for this and overridden below methods for type T:

public override int GetHashCode()
{
    return col1.GetHashCode() + col2.GetHashCode() + col3.GetHashCode();
}

public override bool Equals(object obj)
{
    var other = obj as T;
    if (other == null)
    {
        return false;
    }
    return col1 == other.col1
        && col2 == other.col2
        && col3 == other.col3;
}

option #2

Having List<T> instead of HashSet<T>.

Removing duplicates after all the objects are added to List<T>

    List<T> distinctObjects = allObjects
        .GroupBy(x => new {x.col1, x.col2, x.col3})
        .Select(x => x.First()).ToList();

option #3

Removing duplicates after all the objects are added to DataTable.

public static DataTable RemoveDuplicatesRows(DataTable dataTable)
{
    IEnumerable<DataRow> uniqueRows = dataTable.AsEnumerable().Distinct(DataRowComparer.Default);
    DataTable dataTable2 = uniqueRows.CopyToDataTable();
    return dataTable2;
}

Although I have not compared their running time, but I prefer option#1 as I am removing duplicates as a first step - so moving ahead only with what is required.

Please share your views so I can choose the best one.

Thanks a lot!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
inutan
  • 10,558
  • 27
  • 84
  • 126
  • 1
    wouldn't it be more efficient to get your data from the csv and bulk insert it into a temporary table with no unique constraints, using sql remove the duplicates then move your data that has no dups to your actual final table ? – G-Man Oct 28 '13 at 17:59
  • If you go with option 1 I might change your `GetHashCode` to multiply the properties by prime numbers as in this [answer](http://stackoverflow.com/a/371348/2145211) – Harrison Oct 28 '13 at 17:59
  • Scrap Option 1 - using hash codes does not guarantee uniqueness. – D Stanley Oct 28 '13 at 18:07
  • @DStanley the hash code itself doesn't guarantee uniqueness, but `HashSet`, used as described, does exactly what he's trying to do. – Tim S. Oct 28 '13 at 18:09
  • 1
    @iniki have you considered using SSIS? On the data flow task you can redirect the error rows to some output or ignore them. That way only the first unique row would be inserted. – Josh Jay Oct 28 '13 at 20:15

2 Answers2

5

I like option 1: the HashSet<T> provides a fast way of avoiding duplicates before ever sending them to the DB. You should implement a better GetHashCode, e.g. using Skeet's implementation from What is the best algorithm for an overridden System.Object.GetHashCode?

But there's a problem: what if the table already contains data that can be a duplicate of your CSV? You'd have to copy the whole table down first for a simple HashSet to really work. You could do just that, but to solve this, I might pair option 1 with a temporary table and an insert statement like Skip-over/ignore duplicate rows on insert's:

INSERT dbo.Table1(col1, col2, col3)
SELECT col1, col2, col3 
FROM dbo.tmp_holding_Table1 AS t
WHERE NOT EXISTS (SELECT 1 FROM dbo.Table1 AS d
WHERE col1 = t.col1
AND col2 = t.col2
AND col3 = t.col3);

With this combination, the volume of data transferred to/from your DB is minimized.

Community
  • 1
  • 1
Tim S.
  • 55,448
  • 7
  • 96
  • 122
0

Another solution could be the IGNORE_DUP_KEY = { ON | OFF } option when creating / rebuilding an index. This solution will prevent getting errors with inserting duplicate rows. Instead, SQL Server will generate warnings: Duplicate key was ignored..

CREATE TABLE dbo.MyTable (Col1 INT, Col2 INT, Col3 INT);
GO

CREATE UNIQUE INDEX IUN_MyTable_Col1_Col2_Col3 
ON dbo.MyTable (Col1,Col2,Col3)
WITH (IGNORE_DUP_KEY = ON);
GO

INSERT dbo.MyTable (Col1,Col2,Col3)
VALUES (1,11,111);
INSERT dbo.MyTable (Col1,Col2,Col3)
SELECT 1,11,111 UNION ALL 
SELECT 2,22,222 UNION ALL 
SELECT 3,33,333;
INSERT dbo.MyTable (Col1,Col2,Col3)
SELECT 2,22,222 UNION ALL 
SELECT 3,33,333;
GO
/*
(1 row(s) affected)

(2 row(s) affected)
Duplicate key was ignored.
*/


SELECT * FROM dbo.MyTable;
/*
Col1        Col2        Col3
----------- ----------- -----------
1           11          111
2           22          222
3           33          333
*/

Note: Because you have an UNIQUE constraint if you try to change index options with ALTER INDEX

ALTER INDEX IUN_MyTable_Col1_Col2_Col3
ON dbo.MyTable 
REBUILD WITH (IGNORE_DUP_KEY = ON)

you will get following error:

Msg 1979, Level 16, State 1, Line 1
Cannot use index option ignore_dup_key to alter index 'IUN_MyTable_Col1_Col2_Col3' as it enforces a primary or unique constraint.`

So, if you choose this solution the options are:

1) Create another UNIQUE index and to drop the UNIQUE constraint (this option will require more storage space but will be a UNIQUE index/constraint active all time) or

2) Drop the UNIQUE constraint and create an UNIQUE index with WITH (IGNORE_DUP_KEY = ON) option (I wouldn't recommend this last option).

Bogdan Sahlean
  • 19,233
  • 3
  • 42
  • 57