How to use MERGE and then to split one source record to more target records?

Question

I am writing a data pump that gets information from the source_table with columns code, some_data, oz1, oz2, oz3, oz4, and I do store it in target_table with the same structure. In parallel, I would like to update also another table with different structure (say oz_table) -- records code, oz -- that is four records instead of one record with four values (at maximum because the empty and NULL values for ozX are not to be stored).

I am using the MERGE command (Microsoft T-SQL) for the existing target_table (four oz's in one record -- the old approach). The INSERTed/UPDATEd records are collected using the OUTPUT mechanism into a table variable @info_table. (The target records are intentionally never deleted if the source records disappear; so, there is no DELETE action.)

So far, I have a code like this:

CREATE PROCEDURE dbo.data_pump
AS
BEGIN
    SET NOCOUNT ON
    DECLARE @result int = -555   -- init (number of affected records)

    DECLARE @info_table TABLE (
        action nvarchar(10),
        code int,
        oz1 nvarchar(40),
        oz2 nvarchar(40),
        oz3 nvarchar(40),
        oz4 nvarchar(40)
    )

    BEGIN TRANSACTION tran_data_pump
    BEGIN TRY
        MERGE target_table AS target
        USING (SELECT code, some_data, oz1, oz2, oz3, oz4
               FROM source_table) AS source
        ON target.code = source.code
        WHEN MATCHED AND (COALESCE(target.some_data, '') != COALESCE(source.some_data, '')
                          OR COALESCE(target.oz1, '') != COALESCE(source.oz1, '')
                          OR COALESCE(target.oz2, '') != COALESCE(source.oz2, '')
                          OR COALESCE(target.oz3, '') != COALESCE(source.oz3, '')
                          OR COALESCE(target.oz4, '') != COALESCE(source.oz4, '')
                          ) THEN
            UPDATE
            SET target.some_data = source.some_data,
                target.oz1 = source.oz1,
                target.oz2 = source.oz2,
                target.oz3 = source.oz3,
                target.oz4 = source.oz4
        WHEN NOT MATCHED THEN
            INSERT (code, some_data,
                    oz1, oz2, oz3, oz4)
            VALUES (source.code, source.some_data,
                    source.oz1, source.oz2, source.oz3, source.oz4)
        OUTPUT
            $action AS action,  -- INSERT or UPDATE
            inserted.code AS code,
            inserted.oz1 AS oz1,
            inserted.oz2 AS oz2,
            inserted.oz3 AS oz3,
            inserted.oz4 AS oz4
        INTO @info_table;

        SET @result = @@ROWCOUNT

        COMMIT TRANSACTION tran_data_pump
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION tran_data_pump
        SET @result = -1    -- transaction-failed indication
    END CATCH
    RETURN @result  -- OK, number of the transfered records
END

It works fine, so far. Now I would like to process the @info_table to insert/update the oz_table. For the action UPDATE, the records with the code should be deleted first, and the new ones should be inserted. The order is not important, and the new number of the inserted records may be different. The NULL or empty strings in oz's should not produce any record. For the INSERT action, the situation is simpler just to insert the new records.

Update: The question was slightly modified to clarify the core of the question. The data tables can be defined like that:

CREATE TABLE dbo.source_table (
    ID int IDENTITY PRIMARY KEY NOT NULL,
    code int,
    some_data nvarchar(50),
    oz1 nvarchar(40),
    oz2 nvarchar(40),
    oz3 nvarchar(40),
    oz4 nvarchar(40)
)

CREATE TABLE dbo.target_table (
    ID int IDENTITY PRIMARY KEY NOT NULL,
    code int,
    some_data nvarchar(50),
    oz2 nvarchar(40),
    oz3 nvarchar(40),
    oz1 nvarchar(40),
    oz4 nvarchar(40)
)

CREATE TABLE dbo.oz_table (
    ID int IDENTITY PRIMARY KEY NOT NULL,
    code int,
    oz nvarchar(40) NOT NULL
)

See the full testing script (creates database, the tables, calls the data_pump at http://pastebin.com/wBz3Tzwn

How to do it nicely? I need the the efficient solution as the amount of date may be big, and the operation should be as fast as possible.

You said that _Now I would like to process the `@info_table` to insert/update the `alternative_table`_, is it the 4th table besides `source_table`, `target_table`, `@info_table`? It seems like another MERGE operation. By the way, if you can provide an input / output example, it would be even better for understanding. — Mincong Huang, Sep 26 '16 at 06:51
@MincongHuang: Yes. See the updated question. You will also find the link to the full code snippet, including the testing data. Only, replace `sourceID` by `code` in the `oz_table` definition in the snippet (my mistake). There is no output example, but it should be the unpivoted input with the `code` (if the source record was inserted or modified). — pepr, Sep 27 '16 at 08:49
If interested, please, have a look a the follow-up question http://stackoverflow.com/q/39771406/1346705 — pepr, Sep 29 '16 at 13:18

score 4 · Answer 1 · answered Sep 26 '16 at 07:01

4

If I had understood your problem statement correctly then below approach could be one way to solve -

    -- declare the temp tables
    DECLARE @info_table TABLE (
        action nvarchar(10),
        ID int,
        oz1 nvarchar(40),
        oz2 nvarchar(40),
        oz3 nvarchar(40),
        oz4 nvarchar(40)
    )
    --create intermediate table to store the results
    CREATE TABLE #temp_alternative_table (ID int,oz nvarchar(40))
    -- insert some dummy values
    INSERT INTO @info_table (action,ID,oz1,oz2,oz3,oz4)
    VALUES 
        ('INSERT',1, '85', '94', '78', '90'),
        ('UPDATE',2, '75', '88', '91', '78')
    --SELECT * FROM @info_table
    -- doing unpivot and transforming one row many columns to many rows one column and inserting into intermediate temp table
    INSERT INTO #temp_alternative_table
    SELECT *
    FROM (
        SELECT 
               Action   
            ,  ID
            , [Oz]
        FROM @info_table
        UNPIVOT 
        (
            [Oz] FOR tt IN (oz1, oz2, oz3, oz4)
        ) unpvt
    ) t
    -- delete from main table all the records for which the action is UPDATE (stored in intermediate temp table for the same ID as of main table)
    DELETE at
    FROM alternative_table at
    INNER JOIN #temp_alternative_table tat
      ON at.ID = tat.ID
    WHERE tat.action = 'UPDATE'
    -- now insert all the records in main table
    INSERT INTO alternative_table (ID,Oz)
    SELECT ID,Oz
    FROM #temp_alternative_table

Let me know if this what you were looking for. Hope this helps.

answered Sep 26 '16 at 07:01

Abhishek

2,482
1
21
32

You were first and you did suggest the UNPIVOT. However, the 200 bounty is quite a lot, and I will give you that only when you rethink the solution based on the updated question -- see the snippet for testing available as a link in the question. I already have a more tidy solution. I will not tell you, because you can find even nicer ;) But I am ready to give you the bounty after more effort. – pepr Sep 27 '16 at 12:00
2

@pepr define "more tidy" please. Any by the way, If you do not award your bounty within 7 days (plus the grace period of 24 hours), the highest voted answer created after the bounty started with a minimum score of 2 will be awarded half the bounty amount. It's not nice to threaten a bounty and withhold information which can help developers find a more ideal solution to your problem. – S3S Sep 27 '16 at 16:44
@scsimon: At least, the "more tidy" for me is when it does not contain unnecessary constructs or syntactic "garbage". For example, the one with UNPIVOT can be written like this: `INSERT INTO oz_table SELECT code, oz FROM @info_table UNPIVOT (oz FOR x IN (oz1, oz2, oz3, oz4)) AS t` – pepr Sep 29 '16 at 08:01
I hope I did not threaten Abhishek. He is more SQL guy than probably I am. If I can see the flaw, he also should. ;) – pepr Sep 29 '16 at 08:09
1

@pepr - No hard feelings, it's always good to know different approaches for a particular problem that's why we are here to learn. The solution provided by me was tidy - at least I feel that way just that I had divided into many small steps for better understanding - we can always improve upon the base :) – Abhishek Sep 29 '16 at 08:44
@Abhishek: No offence. Vladimir emphasized also other important things. From my point of view, the CROSS APPLY and UNPIVOT are comparable. So, the only "flaw" is the extra SELECT in your solution that need not to be there. Frankly, if it was possible to split the bounty, I would give you a half. – pepr Sep 29 '16 at 10:09
If interested, please, have a look a the follow-up question http://stackoverflow.com/q/39771406/1346705 – pepr Sep 29 '16 at 13:20

Vladimir Baranov · Accepted Answer · 2016-09-27T13:27:13.270

Since you are talking about efficiency, at first there should be appropriate indexes.

source_table and target_table should have unique index on code. It should be unique, otherwise the main MERGE would fail when it attempts to update the same row more than once.

oz_table should have non-unique index on code.

@info_table should have code as a primary key. It is not possible to have two different actions for the same code from one MERGE, so code should be unique:

DECLARE @info_table TABLE 
(
    action nvarchar(10),
    code int PRIMARY KEY,
    oz1 nvarchar(40),
    oz2 nvarchar(40),
    oz3 nvarchar(40),
    oz4 nvarchar(40)
);

There is no need for extra temporary tables. We already have one - @info_table.

After the MERGE, when @info_table is populated we need to do two steps: 1) delete some rows from oz_table, 2) add some rows to oz_table.

At first delete from oz_table those rows that were updated by the MERGE. In fact, those rows that were inserted by the MERGE would not exist in oz_table any way, so we can use a simple DELETE statement. There is no need to explitly filter by UPDATE action. Such filter would not remove any rows.

If there is a chance that oz_table can be changed outside this data pump process, then extra filter is needed.

DELETE FROM dbo.oz_table
WHERE dbo.oz_table.code IN
    (
        SELECT code FROM @info_table
    )
;

Indexes on code on both tables would help to join them effiently.

Then, simply insert both updated and inserted rows.

INSERT INTO dbo.oz_table(code, oz)
SELECT
    T.code
    ,CA.oz
FROM
    @info_table AS T
    CROSS APPLY
    (
        VALUES
            (T.oz1),
            (T.oz2),
            (T.oz3),
            (T.oz4)
    ) AS CA (oz)
WHERE
    CA.oz IS NOT NULL
    AND CA.oz <> ''
    -- The NULL or empty strings should not produce any record
;

I prefer to use CROSS APPLY and VALUES instead of UNPIVOT. We want to remove NULL and empty values, so I think it is better to write the filter explicitly, so that everybody reading the code sees it. With UNPIVOT you need to know that it removes NULL implicitly. You'll still have to add a filter to remove empty values. In this case the performance would mosts likely be the same, but you'd better check on the real data.

If there is a chance that this data pump procedure can be called more than once at the same time, then you have to take extra steps to prevent possible concurrency problems (incorrect results or deadlocks). I prefer to use sp_getapplock to guarantee that only one instance of a stored procedure can run at any moment.

+1 always good to know different approaches to a solution and the NULL handling by `UNPIVOT` and `CROSS APPLY` — Abhishek, Sep 28 '16 at 13:15
Thanks, Vladimir, for the comment. I am aware of the indexes (wanted just to keep the question simple). Frankly, I originally wanted to avoid making a cursor and the loop. I am also aware about the concurrency issues. Anyway, would not the transaction prevent a possible confusion? For the delete, I did use INNER JOIN with the `@info_table`. Is the IN more efficient? — pepr, Sep 29 '16 at 06:25
@pepr, yes, all the steps (`MERGE`, `DELETE`, `INSERT`) should be wrapped in a transaction. I didn't analyse deeply if deadlocks in your case are possible or not. It may depend on your data. For example, can two processes try to update the same `code` at the same time? If deadlocks are possible, then having just a transaction (with default isolation level) may be not enough to prevent them. You'll have to test it. `JOIN` vs `IN` - try both and compare the execution plans. Most likely they will be the same. For me, `IN` is a bit easier to read/understand. — Vladimir Baranov, Sep 29 '16 at 06:54
words like efficiency and performance and the suggestion is MERGE, IN, and table variables.... I beg to differ MERGE doesn't get optimized very well and its performance in my environments has been WAY poorer than simple Update, Insert, Delete Statements. Add in that @info_table is a table variable NOT a temp table and its performance would not be as good as say an indexed temp table if you are dealing with any decent amount of data. finally using IN is the least performant method of comparison http://stackoverflow.com/questions/173041/not-in-vs-not-exists — Matt, Sep 29 '16 at 13:17
If interested, please, have a look a the follow-up question http://stackoverflow.com/q/39771406/1346705 — pepr, Sep 29 '16 at 13:20
@Matt: Thanks for the comment. I will focus on the efficiency later. I may write the related question to Code Review. Then I will put the link here for you to know. — pepr, Sep 29 '16 at 13:40

score 1 · Answer 3 · answered Sep 26 '16 at 19:37

A slightly different approach to consider is to define an update/insert trigger on your target_table. With this method your data pump only need to consider your initial target table. The trigger will transform and insert the merged data into the alternative_table table.

If you are willing to incur the extra data storage cost of storing the original oz column names as a search key in your alternative table, then you can use another merge statement which will improve the overall performance of the trigger as follow:

-- Create example table
CREATE TABLE [dbo].[alternative_table](
    [ID] [int] NOT NULL,
    [ColumnKey] [nvarchar](5) NOT NULL,
    [oz] [nvarchar](100) NULL,
 CONSTRAINT [PK_alternative_table] PRIMARY KEY CLUSTERED 
(
    [ID] ASC,
    [ColumnKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

-- Create trigger responsible for populating Alternative table.
CREATE TRIGGER dbo.MergeWithAlternative
   ON  dbo.target_table
   AFTER INSERT, UPDATE
AS 
BEGIN

    SET NOCOUNT ON;


    MERGE [alternative_table] AS [target]
    USING   (
                SELECT 
                      [ID]
                    , [ColumnKey]
                    , [Oz]
                FROM inserted
                UNPIVOT 
                (
                    [Oz] FOR [ColumnKey] IN (oz1, oz2, oz3, oz4)
                ) unpvt
            ) AS [source]
    ON [target].ID = [source].ID AND [target].[ColumnKey] = [source].[ColumnKey]
    WHEN MATCHED THEN
        UPDATE
        SET [target].oz = [source].[Oz]
    WHEN NOT MATCHED THEN
        INSERT 
        (
            ID
            ,[ColumnKey] 
            ,[oz]
        )
        VALUES 
        (
            source.ID
            ,source.[ColumnKey]
            ,source.[Oz]
        );

END

If you are not willing to store the column name as a key lookup then a simple delete / insert action will work as well:

-- Create example table
CREATE TABLE [dbo].[alternative_table](
    [ID] [int] NOT NULL,
    [oz] [nvarchar](100) NULL
) ON [PRIMARY]

GO

-- Create trigger responsible for populating Alternative table.
CREATE TRIGGER dbo.MergeWithAlternative
   ON  dbo.target_table
   AFTER INSERT, UPDATE
AS 
BEGIN

    SET NOCOUNT ON;


    DELETE [dbo].[alternative_table]
    WHERE   [ID] IN (SELECT ID FROM deleted)

    INSERT INTO [dbo].[alternative_table]
    (
        [ID]
        ,[oz]
    )
    SELECT  [ID]
            ,[Oz]
    FROM inserted
    UNPIVOT 
    (
        [Oz] FOR [ColumnKey] IN (oz1, oz2, oz3, oz4)
    ) unpvt

END

The table variable and the output clause that populated it should not be needed anymore with this approach.

Thanks Edmond. I would prefer non-trigger solution as there can be a lot of data, and the solution must be fast. Also, through the OUTPUT, I know exactly what target records should be affected and how. +1 for your help, and for the unpivot. — pepr, Sep 27 '16 at 08:52
I need to merge based on the `code` only (was `ID` before the question was updated). This is the only criterium whether to UPDATE or INSERT. The INSERT decision is simple. However, the UPDATE should be done only when certain column differ -- the oz values belong to the club. The processing of the affected records should also be as efficient as it can be. I will add some comments below the question. — pepr, Sep 27 '16 at 11:54
@pepr just a couple of questions. Is the source table a cache / staging table that gets trimmed to keep its size relatively small or does it keep growing over time? What version of SQL Server are you using? Also are you using Enterprise or Standard edition? Can source table schema be altered slightly (such as adding an extra column)? — Edmond Quinton, Sep 28 '16 at 13:24
The source table is actually the result of `OPENQUERY()` from a SYBASE -- third party, I cannot change it. The target database is MS SQL Express 2014 (I believe). — pepr, Sep 29 '16 at 08:14
If interested, please, have a look a the follow-up question http://stackoverflow.com/q/39771406/1346705 — pepr, Sep 29 '16 at 13:20

Ahmed Saeed · Answer 4 · 2016-09-28T19:55:39.663

Since we need to build the oz_table by replacing the old records and adding new records, I prefer to truncate it first, and rebuild it from scratch by INSERTING all the records.. I'll use two cte's, the first one to read the @info_table while the second one to construct all the rows by UNION all the four oz columns. Then just insert the UNION to the oz_Table.

Truncate table dbo.oz_table

with cte as(
    Select Code, oz1, oz2, oz3, oz4 from @info_table
), cte2 as(
            Select Code, oz1 as oz From cte
  UNION ALL Select Code, oz2 as oz From cte
  UNION ALL Select Code, oz3 as oz From cte
  UNION ALL Select Code, oz4 as oz From cte
)
Insert into dbo.oz_table(Code, oz)
select Code, oz from cte2
Where oz is not null and oz<>''

How to use MERGE and then to split one source record to more target records?

4 Answers4