0

I'm feeding data into SQL Server database and 1 out of every 1000 records is a duplicate due to matters outside my control. It's an exact duplicate - the entire record, the unique identifier -- everything.

I know this can solved with an 'updated' rather than insert step ... or 'on error, update' instead of insert, perhaps.

But is there a quick and easy way to make SQL Server ignore these duplicates? I haven't made an index/ unique constraint yet -- but if I did that, I don't want a 'duplicate' key value breaking or interrupting the ETL/ data flow process. I just SQL Server to keep executing the insert query. Is there a way to do this?

user45867
  • 887
  • 2
  • 17
  • 30
  • 1
    @FirebladeDan Except `mysql` <> `sql-server`. – Aaron Bertrand Aug 19 '15 at 19:55
  • Yeah I'd change the syntax to the verbose equivalent in sql server --- I know the 'concept' is a possibility -- I just think it has a lot of overhead. – user45867 Aug 19 '15 at 20:00
  • https://stackoverflow.com/questions/1139050/how-to-ignore-duplicate-key-error-in-t-sql-sql-server – wero Aug 19 '15 at 20:05
  • Thanks .. looks IGNORE_DUP_KEY is the quickest possible way (maybe) in terms of query time. Potentially dangerous if there are any grievous data errors down the road, but tempting. – user45867 Aug 19 '15 at 20:10

3 Answers3

3

Just add a WHERE NOT EXISTS to the statement you're executing -

INSERT INTO table VALUES('123', 'blah') WHERE NOT EXISTS(select top 1 from table where unique_identifier_column = '123')
James
  • 411
  • 4
  • 20
  • This may be a possibility. I'm really looking for something with the least overhead as possible. IE; inserting about 1.5 million records and probably adding 50,000 daily --- but I'll try it out. – user45867 Aug 19 '15 at 19:54
1

Just to be clear for anyone else hitting this issue, for the best performance and a slight chance of losing an insert, one should define primary key in the table and use IGNORE_DUP_KEY = ON.

0

If you're looking for a duplicate record on every field just use the distinct clause in your select:

Insert into DestinationTable
Select Distinct *
From SourceTable

EDIT: I misinterpreted your question. You're trying to find a low impact way to prevent adding a record that already exists in your DestinationTable.

If you want your inserts to remain fast, one way to do it is to add an identity column to your table as the primary key. Let your duplicate records get added, but then run a maintenance routine on down or slow time that checks all records added since the last check and deletes any added duplicates. Otherwise, there is no easy way... you will have to check on every insert.

Brian Pressler
  • 6,653
  • 2
  • 19
  • 40