0

I need to deduplicate various data when I am importing them from different sources (Json files, other databases and REST APIs), first I load them into a single table, which defines the type for them and store the data as Json, so later when I run the batch processing, I can look up the type and insert the data to the suitable tables. The number of imported rows are different (each type goes to different table/tables), but always more than 1 million (all together ~10G of data if I place them in Json format in a single table using VARCHAR(MAX)).

As I mentioned, I need to handle duplicates, so I try to define unique indexes for the target tables and enable Ignore Duplicate Keys, which 'only' raises a warning when I insert existing data. The problem is, this only works in few cases. Most of the time, I need to work with 5+ varchar(255) fields, and I cannot add them to a unique index, because of the limit (900 byte, src).

Another thing I am struggling with, is during batch inserting, I need to insert relational data , meaning one table will have foreign keys to another. So first I need to handle the dependencies, and after I got their inserted Ids, using those I can insert the data. Like a product has a manufacturer, so first I insert all the manufacturer names in the current batch, then using those Ids I can insert the products.

The need for returning Ids and doing deduplication results in a query I would like to achieve:

  • Will run concurrently, by 8-16 threads
  • Should return the inserted Id
  • Should only insert data If it is not inserted by another thread before (or not inserted before at all)

First, I tried to handle this, by making stored procedures like this:

  1. Try to select the data, If found, return the Id
  2. If not found, start a transaction
  3. Check again, if it already got inserted by another thread.
  4. If not, insert and return the new Id.

Code example for this.:

CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
    @GDDataSourceVersionId INT,
    @ManufacturerNameId BIGINT,
    @ManufacturerReference NVARCHAR(255),
    @PropertiesJson NVARCHAR(MAX),
    @OriginalContentPage NVARCHAR(MAX),
    @NewId BIGINT OUT
AS
BEGIN
    SET NOCOUNT ON;

    SELECT @NewId = [Id] FROM PDProductDetails
    WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
          ManufacturerId = @ManufacturerNameId AND
          ManufacturerReference = @ManufacturerReference;

    IF @NewId IS NULL
    BEGIN
        SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
        BEGIN TRANSACTION
            SELECT @NewId = [Id] FROM PDProductDetails
            WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
                  ManufacturerId = @ManufacturerNameId AND
                  ManufacturerReference = @ManufacturerReference;
            IF @NewId IS NULL
            BEGIN 
                INSERT INTO PDProductDetails (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
                VALUES(@GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage);
                SELECT @NewId = SCOPE_IDENTITY();
            END
        COMMIT TRANSACTION
    END

    SELECT @NewId;
END
GO

Multiple threads would call this and insert the Product details. However, using this I got deadlocked really fast. I changed to a different approach, using Merge:

CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
    @GDDataSourceVersionId INT,
    @ManufacturerNameId BIGINT,
    @ManufacturerReference NVARCHAR(255),
    @PropertiesJson NVARCHAR(MAX),
    @OriginalContentPage NVARCHAR(MAX),
    @NewId BIGINT OUT
AS
BEGIN
    SET NOCOUNT ON;

    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    BEGIN TRANSACTION;
        MERGE
        INTO [dbo].[PDProductDetails] T
        USING (SELECT @GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage)
            AS Source (GDDataSourceVersionId, ManufacturerNameId, ManufacturerReference, PropertiesJson, OriginalContentPage)
        ON T.GDDataSourceVersionId = Source.GDDataSourceVersionId AND
            T.ManufacturerId = Source.ManufacturerNameId AND
            T.ManufacturerReference = Source.ManufacturerReference
        WHEN NOT MATCHED THEN
            INSERT (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
            VALUES(Source.GDDataSourceVersionId, Source.ManufacturerNameId, 
                Source.ManufacturerReference, Source.PropertiesJson, Source.OriginalContentPage);
    COMMIT TRANSACTION;

    SELECT @NewId = [Id] FROM PDProductDetails (NOLOCK)
    WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
          ManufacturerId = @ManufacturerNameId AND
          ManufacturerReference = @ManufacturerReference;

    SELECT @NewId;
END
GO

This always merges the row and selects later. It still deadlocks tough, not as fast as the other one, but still.

How can I achieve an insert ignore and return inserted id functionality, which won't deadlock in concurrent environment?

szab.kel
  • 2,356
  • 5
  • 40
  • 74
  • Please void asking multiple distinct questions. This is currently too broad IMO. IMO you should separate the two questions you have. – TT. Oct 12 '17 at 12:17
  • @TT. I think this is a single question. Asking how can I do insert ignores in a concurrent way, while also returning Ids. – szab.kel Oct 12 '17 at 12:19
  • Well... after your first question, you go one with _Another thing I am struggling with_. That's two distinct issues. – TT. Oct 12 '17 at 12:20
  • @TT. I bet you did not even read the whole question and just stopped there right away. So I just won't answer to you anymore. I did provide a context to my question, that is all. – szab.kel Oct 12 '17 at 12:21
  • In fact I did. The two issues are distinct, hence my comment about "Too broad". – TT. Oct 12 '17 at 12:22
  • 1
    MERGE with a table valued parameter and OUTPUT the before/after identifiers. – ta.speot.is Oct 12 '17 at 12:30
  • @ta.speot.is Hm, I somehow missed you can do OUTPUT with MERGE. It seems to work. You should write this as an answer I guess. Also, is there any way to put the OUTPUT into an OUT variable? – szab.kel Oct 12 '17 at 12:36
  • OUTPUT could work too, but I managed using this.: https://stackoverflow.com/a/16838803/840315 – szab.kel Oct 12 '17 at 12:44
  • The point of using a table valued parameter is to insert multiple rows at once. Then you OUTPUT into another table valued parameter as the result or just straight up let the OUTPUT be a result set. – ta.speot.is Oct 12 '17 at 13:25
  • @ta.speot.is I just cannot input table values into my stored procedures as parameters. Java JPA 2.1 do not support it. Maybe I could do it in JDBC https://stackoverflow.com/a/36618680/840315 but currently we only support JPA – szab.kel Oct 12 '17 at 13:32

1 Answers1

0

After @ta.speot.is mentioned you could do OUTPUT with merge, I searched for how to assign it to a variable and an answer mentioned it.

I used this stored procedure.:

CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
    @GDDataSourceVersionId INT,
    @ManufacturerNameId BIGINT,
    @ManufacturerReference NVARCHAR(255),
    @PropertiesJson NVARCHAR(MAX),
    @OriginalContentPage NVARCHAR(MAX),
    @NewId BIGINT OUT
AS
BEGIN
    SET NOCOUNT ON;

    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    BEGIN TRANSACTION;
        MERGE
        INTO [dbo].[PDProductDetails] T
        USING (SELECT @GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage)
            AS Source (GDDataSourceVersionId, ManufacturerNameId, ManufacturerReference, PropertiesJson, OriginalContentPage)
        ON T.GDDataSourceVersionId = Source.GDDataSourceVersionId AND
            T.ManufacturerId = Source.ManufacturerNameId AND
            T.ManufacturerReference = Source.ManufacturerReference
        WHEN MATCHED THEN
            UPDATE SET @NewId = T.Id
        WHEN NOT MATCHED THEN
            INSERT (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
            VALUES(Source.GDDataSourceVersionId, Source.ManufacturerNameId, 
                Source.ManufacturerReference, Source.PropertiesJson, Source.OriginalContentPage);

    SET @NewId = ISNULL(@NewId, SCOPE_IDENTITY());

    COMMIT TRANSACTION;

    SELECT @NewId;
END
GO

Edit.: As @ta.speot.is mentioned, it would have been better, to make batch request using table-valued parameters, using the same approach (MERGE would use the table input as Source).

szab.kel
  • 2,356
  • 5
  • 40
  • 74