Skip-over/ignore duplicate rows on insert

Question

I have the following tables:

DataValue

DateStamp    ItemId   Value
----------   ------   -----
2012-05-22   1        6541
2012-05-22   2        12321
2012-05-21   3        32

tmp_holding_DataValue

DateStamp    ItemId   Value
----------   ------   -----
2012-05-22   1        6541
2012-05-22   4        87
2012-05-21   5        234

DateStamp and ItemId are the primary key columns.

I'm doing an insert which runs periodically throughout the day (in a stored procedure):

insert into DataValue(DateStamp, ItemId, Value)
select DateStamp, ItemId, Value from tmp_holding_DataValue;

This moves data from the holding table (tmp_holding_DataValue) across into the main data table (DataValue). The holding table is then truncated.

The problem is that as in the example, the holding table could contain items which already exist in the main table. Since the key will not allow duplicate values the procedure will fail.

One option would be to put a where clause on the insert proc, but the main data table has 10 million+ rows, and this could take a long time.

Is there any other way to get the procedure to just skip-over/ignore the duplicates as it tries to insert?

What if the `Value` column in the holding table is different, e.g. for the first row it is `3253` instead of `6541`? Is that still a duplicate? If not, is it something you want to update (e.g. add `6541 + 3253` in the source table) or simply replace? — Aaron Bertrand, May 22 '12 at 14:12
The value column doesn't matter, if its different it is ignored, what is in the DataValue already for that datestamp should be left as-is — finoutlook, May 22 '12 at 14:14
It's also very useful to tag your question with the minimum *version* of SQL Server you need to support. I didn't offer a `MERGE` solution because initially I had absolutely no idea what version you were using. — Aaron Bertrand, May 22 '12 at 14:34
@AaronBertrand - Yeah I should have mentioned I'm using 2008 (soon moving to 2012). Would you go with merge over the where in that case? — finoutlook, May 22 '12 at 14:49
I'm not sure, I still find the `MERGE` syntax daunting, and I am hesitant to recommend it in general. I'm not sure if all of the MERGE bugs have been fixed (see a [list referenced by Alex K in his answer to this 2012 question](http://dba.stackexchange.com/questions/14730/what-are-objective-business-reasons-to-prefer-sql-server-2012-over-2008-r2)). If your key goal is performance, then it's up to you to test them and make sure that (a) they do the right thing and (b) you choose the on that performs best in your environment. We can't predict those answers... — Aaron Bertrand, May 22 '12 at 14:53

score 32 · Accepted Answer · answered May 22 '12 at 14:10

32

INSERT dbo.DataValue(DateStamp, ItemId, Value)
SELECT DateStamp, ItemId, Value 
FROM dbo.tmp_holding_DataValue AS t
WHERE NOT EXISTS (SELECT 1 FROM dbo.DataValue AS d
WHERE DateStamp = t.DateStamp
AND ItemId = t.ItemId);

answered May 22 '12 at 14:10

Aaron Bertrand

272,866
37
466
490

4

This would work but I was wondering if there was anything quicker incase the DataValue table eventually ends up with 100 million rows – finoutlook May 22 '12 at 14:18
If the primary key is clustered and the holding table has an equivalent index then it shouldn't be a problem (or at least no more of a problem than any other solution that checks for duplicates). Does the holding table ever have "old" data, or are you always appending rather newish data? You could add where clauses that limit the date to something reasonable, like two days ago, and if `DateStamp` is the leading column in the primary key this should help out a little bit. But only if you always have new data in the holding table. – Aaron Bertrand May 22 '12 at 14:22
1

Thanks I went with this solution – finoutlook May 29 '12 at 09:21
2

@finoutlook If I/O becomes an issue or you want to limit contention on the table, you can run it in batches by using something like `SELECT TOP 10000`. Since each iteration will insert records that disqualify themselves in later executions, this will allow you to limit how much of an effect you have on your server. You shouldn't need to bother with an `ORDER BY` if your goal is to eventually have everything moved since you don't care if the SELECT is deterministic anymore. – Bacon Bits Jun 24 '15 at 18:12

score 23 · Answer 2 · edited Jun 24 '15 at 17:51

23

You could assign the PK as Ignore Duplicate Key = Yes. Then it will just give a warning duplicate key ignored and continue. I am not guessing. I tested this.

What I found is that I cannot do this is SMSS. Have to drop and recreate the index via script. But you can right click on the index, select drop and recreate, and then just change Ignore Duplicate Key = Yes. For me SMSS did not immediately show the change.

IF  EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[PKallowDup]') AND name = N'PK_PKallowDup')
ALTER TABLE [dbo].[PKallowDup] DROP CONSTRAINT [PK_PKallowDup]
GO

USE [test]
GO

/****** Object:  Index [PK_PKallowDup]    Script Date: 05/22/2012 10:23:13 ******/
ALTER TABLE [dbo].[PKallowDup] ADD  CONSTRAINT [PK_PKallowDup] PRIMARY KEY CLUSTERED 
(
    [PK] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = ON, IGNORE_DUP_KEY = ON, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

Or I think you could use an outer join

INSERT dbo.DataValue(DateStamp, ItemId, Value)
SELECT t.DateStamp, t.ItemId, t.Value 
  FROM dbo.tmp_holding_DataValue AS t 
  left join dbo.DataValue AS d
    on d.DateStamp = t.DateStamp
   AND d.ItemId = t.ItemId
 WHERE d.DateStamp is null 
   and d.ItemId    in null

edited Jun 24 '15 at 17:51

ughai

9,830
3
29
47

answered May 22 '12 at 14:39

paparazzo

44,497
23
105
176

I saw this suggested elsewhere but wanted to keep the primary key as it was. Its quite critical there are no duplicates in the final `DataValue` table. – finoutlook May 22 '12 at 14:48
7

Why is this tricky? `WITH (IGNORE_DUP_KEY = ON);` Also @finoutlook did you try this option on a simple table? It's still a primary key, and duplicates still aren't allowed. The `IGNORE_DUP_KEY` setting merely controls how SQL Server handles key violations (with an exception or with a simple status message that says `Duplicate key was ignored.`). – Aaron Bertrand May 22 '12 at 14:51
2

There is still a PK and it is enforced. The difference is that a PK violation is just a warning and it continues inserting rows when Ignore Duplicate Key = Yes. – paparazzo May 22 '12 at 15:12
@AaronBertrand ALTER INDEX [PK_PKallowDup] ON [dbo].[PKallowDup] REBUILD WITH (IGNORE_DUP_KEY = ON); Fails for me. Is that syntax correct? I said I tested. I can create table with that condition or I can drop and create on an empty table. Did not think the OP wanted to delete his table rows. – paparazzo May 22 '12 at 15:42
You can't do this with ALTER INDEX, but you can certainly drop and re-create the constraint without "deleting his table rows." While this may be expensive, it's a one-time cost as opposed to paying for that duplicate check on every single insert. In other words I do like your answer I just don't think it's as tricky as your answer implies. – Aaron Bertrand May 22 '12 at 15:52
@AaronBertrand OK you are correct. A drop and recreate will alter the index. SMSS does not immediately reflect the change even with a refresh but the change does happen. If I leave SMSS and open it then I see the change. – paparazzo May 22 '12 at 16:14
2

For future reference if the PK is the clustered key you can do it with ALTER TABLE dbo.PKallowDup REBUILD WITH (IGNORE_DUP_KEY = ON). It's strange that it doesn't let you specify the PK name but oh well :-) – Cody Konior Jun 14 '16 at 02:41
@CodyKonior +1! Just ran into the same situation and your answer works...this should be the accepted answer, IMO – MattE Mar 25 '19 at 15:42

score 17 · Answer 3 · edited Mar 03 '21 at 08:17

17

In SQL Server 2008+:

MERGE
INTO    dataValue dv
USING   tmp_holding_DataValue t
ON      t.dateStamp = dv.dateStamp
        AND t.itemId = dv.itemId
WHEN NOT MATCHED THEN
INSERT  (dateStamp, itemId, value)
VALUES  (dateStamp, itemId, value)
/*
WHEN MATCHED THEN
UPDATE SET
        value = t.value
*/
-- Uncomment above to rewrite duplicates rather than ignore them

edited Mar 03 '21 at 08:17

BioTronic

2,279
13
15

answered May 22 '12 at 14:12

Quassnoi

413,100
91
616
614

I did think of using a merge, but with DataValue being 10m+ rows, and tmp_holding_DataValue being around 2m rows, I thought it would take a long time, since it would be checking all data back to the start of time in that table... – finoutlook May 22 '12 at 14:16
1

@finoutlook: in other words, you optimized prematurely? Just try it. – Quassnoi May 22 '12 at 14:16
1

I always plan for the worst and hope for the best..! I'll give it a shot – finoutlook May 22 '12 at 14:24

score 1 · Answer 4 · answered Oct 31 '17 at 19:42

I ran into a similar requirement that ended up throwing the same duplicate key error, and then the idea was to select multiple columns that are distinct (Primary) while returning also other columns, check:

INSERT INTO DataValue(DateStamp, ItemId, Value)
SELECT DISTINCT DateStamp, ItemId, MAX(Value) AS Value
FROM tmp_holding_DataValue
GROUP BY DateStamp, ItemId

In fact, the goal could be accomplished without Distinct as well since the aggregate function MAX will pick a single value.

Skip-over/ignore duplicate rows on insert

4 Answers4

Linked