Best way to do batch insert ignore and return id in SQL Server?

Question

I need to deduplicate various data when I am importing them from different sources (Json files, other databases and REST APIs), first I load them into a single table, which defines the type for them and store the data as Json, so later when I run the batch processing, I can look up the type and insert the data to the suitable tables. The number of imported rows are different (each type goes to different table/tables), but always more than 1 million (all together ~10G of data if I place them in Json format in a single table using VARCHAR(MAX)).

As I mentioned, I need to handle duplicates, so I try to define unique indexes for the target tables and enable Ignore Duplicate Keys, which 'only' raises a warning when I insert existing data. The problem is, this only works in few cases. Most of the time, I need to work with 5+ varchar(255) fields, and I cannot add them to a unique index, because of the limit (900 byte, src).

Another thing I am struggling with, is during batch inserting, I need to insert relational data , meaning one table will have foreign keys to another. So first I need to handle the dependencies, and after I got their inserted Ids, using those I can insert the data. Like a product has a manufacturer, so first I insert all the manufacturer names in the current batch, then using those Ids I can insert the products.

The need for returning Ids and doing deduplication results in a query I would like to achieve:

Will run concurrently, by 8-16 threads
Should return the inserted Id
Should only insert data If it is not inserted by another thread before (or not inserted before at all)

First, I tried to handle this, by making stored procedures like this:

Try to select the data, If found, return the Id
If not found, start a transaction
Check again, if it already got inserted by another thread.
If not, insert and return the new Id.

Code example for this.:

CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
    @GDDataSourceVersionId INT,
    @ManufacturerNameId BIGINT,
    @ManufacturerReference NVARCHAR(255),
    @PropertiesJson NVARCHAR(MAX),
    @OriginalContentPage NVARCHAR(MAX),
    @NewId BIGINT OUT
AS
BEGIN
    SET NOCOUNT ON;

    SELECT @NewId = [Id] FROM PDProductDetails
    WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
          ManufacturerId = @ManufacturerNameId AND
          ManufacturerReference = @ManufacturerReference;

    IF @NewId IS NULL
    BEGIN
        SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
        BEGIN TRANSACTION
            SELECT @NewId = [Id] FROM PDProductDetails
            WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
                  ManufacturerId = @ManufacturerNameId AND
                  ManufacturerReference = @ManufacturerReference;
            IF @NewId IS NULL
            BEGIN 
                INSERT INTO PDProductDetails (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
                VALUES(@GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage);
                SELECT @NewId = SCOPE_IDENTITY();
            END
        COMMIT TRANSACTION
    END

    SELECT @NewId;
END
GO

Multiple threads would call this and insert the Product details. However, using this I got deadlocked really fast. I changed to a different approach, using Merge:

CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
    @GDDataSourceVersionId INT,
    @ManufacturerNameId BIGINT,
    @ManufacturerReference NVARCHAR(255),
    @PropertiesJson NVARCHAR(MAX),
    @OriginalContentPage NVARCHAR(MAX),
    @NewId BIGINT OUT
AS
BEGIN
    SET NOCOUNT ON;

    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    BEGIN TRANSACTION;
        MERGE
        INTO [dbo].[PDProductDetails] T
        USING (SELECT @GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage)
            AS Source (GDDataSourceVersionId, ManufacturerNameId, ManufacturerReference, PropertiesJson, OriginalContentPage)
        ON T.GDDataSourceVersionId = Source.GDDataSourceVersionId AND
            T.ManufacturerId = Source.ManufacturerNameId AND
            T.ManufacturerReference = Source.ManufacturerReference
        WHEN NOT MATCHED THEN
            INSERT (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
            VALUES(Source.GDDataSourceVersionId, Source.ManufacturerNameId, 
                Source.ManufacturerReference, Source.PropertiesJson, Source.OriginalContentPage);
    COMMIT TRANSACTION;

    SELECT @NewId = [Id] FROM PDProductDetails (NOLOCK)
    WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
          ManufacturerId = @ManufacturerNameId AND
          ManufacturerReference = @ManufacturerReference;

    SELECT @NewId;
END
GO

This always merges the row and selects later. It still deadlocks tough, not as fast as the other one, but still.

How can I achieve an insert ignore and return inserted id functionality, which won't deadlock in concurrent environment?

Please void asking multiple distinct questions. This is currently too broad IMO. IMO you should separate the two questions you have. — TT., Oct 12 '17 at 12:17
@TT. I think this is a single question. Asking how can I do insert ignores in a concurrent way, while also returning Ids. — szab.kel, Oct 12 '17 at 12:19
Well... after your first question, you go one with _Another thing I am struggling with_. That's two distinct issues. — TT., Oct 12 '17 at 12:20
@TT. I bet you did not even read the whole question and just stopped there right away. So I just won't answer to you anymore. I did provide a context to my question, that is all. — szab.kel, Oct 12 '17 at 12:21
In fact I did. The two issues are distinct, hence my comment about "Too broad". — TT., Oct 12 '17 at 12:22
MERGE with a table valued parameter and OUTPUT the before/after identifiers. — ta.speot.is, Oct 12 '17 at 12:30
@ta.speot.is Hm, I somehow missed you can do OUTPUT with MERGE. It seems to work. You should write this as an answer I guess. Also, is there any way to put the OUTPUT into an OUT variable? — szab.kel, Oct 12 '17 at 12:36
OUTPUT could work too, but I managed using this.: https://stackoverflow.com/a/16838803/840315 — szab.kel, Oct 12 '17 at 12:44
The point of using a table valued parameter is to insert multiple rows at once. Then you OUTPUT into another table valued parameter as the result or just straight up let the OUTPUT be a result set. — ta.speot.is, Oct 12 '17 at 13:25
@ta.speot.is I just cannot input table values into my stored procedures as parameters. Java JPA 2.1 do not support it. Maybe I could do it in JDBC https://stackoverflow.com/a/36618680/840315 but currently we only support JPA — szab.kel, Oct 12 '17 at 13:32

szab.kel · Accepted Answer · 2017-10-14T12:10:29.707

After @ta.speot.is mentioned you could do OUTPUT with merge, I searched for how to assign it to a variable and an answer mentioned it.

I used this stored procedure.:

CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
    @GDDataSourceVersionId INT,
    @ManufacturerNameId BIGINT,
    @ManufacturerReference NVARCHAR(255),
    @PropertiesJson NVARCHAR(MAX),
    @OriginalContentPage NVARCHAR(MAX),
    @NewId BIGINT OUT
AS
BEGIN
    SET NOCOUNT ON;

    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    BEGIN TRANSACTION;
        MERGE
        INTO [dbo].[PDProductDetails] T
        USING (SELECT @GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage)
            AS Source (GDDataSourceVersionId, ManufacturerNameId, ManufacturerReference, PropertiesJson, OriginalContentPage)
        ON T.GDDataSourceVersionId = Source.GDDataSourceVersionId AND
            T.ManufacturerId = Source.ManufacturerNameId AND
            T.ManufacturerReference = Source.ManufacturerReference
        WHEN MATCHED THEN
            UPDATE SET @NewId = T.Id
        WHEN NOT MATCHED THEN
            INSERT (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
            VALUES(Source.GDDataSourceVersionId, Source.ManufacturerNameId, 
                Source.ManufacturerReference, Source.PropertiesJson, Source.OriginalContentPage);

    SET @NewId = ISNULL(@NewId, SCOPE_IDENTITY());

    COMMIT TRANSACTION;

    SELECT @NewId;
END
GO

Edit.: As @ta.speot.is mentioned, it would have been better, to make batch request using table-valued parameters, using the same approach (MERGE would use the table input as Source).

Best way to do batch insert ignore and return id in SQL Server?

1 Answers1