How to optimize MERGE using multiple column index

Question

I'm new to MERGE and an index novice so bear with me...

I have a stored procedure that builds a #changes temp table, UPDATEs a prod_tbl table based on #changes, then INSERTs the before and after values into auto_update_log table that is quickly growing with duplicate rows. In order to prevent this, I'd like to use the MERGE statement. Speed is paramount, as well as thread safety, because this table will be manipulated throughout the day.

There aren't any existing indexes on auto_update_log, nor is there any type of key. My idea is to create a multi-column index using a key column (from the auto_update_log table) and all of the before and after columns to help speed things up with the MERGE. There are going to be 6 before and after columns plus the one key column that relates back to the PK of auto_update_log.

Sample Log Table:

CREATE TABLE dbo.sample_auto_update_log (
    id INT NOT NULL, --Primary key from [prod_tbl]
    item_a_before VARCHAR(25) NULL, --[prod_tbl].[item_a]
    item_a_detail VARCHAR(25) NULL, --Value from elsewhere in the DB that applies
    item_a_after VARCHAR(25) NULL, --The new value SET for [prod_tbl].[item_a]
    update_count INT NOT NULL DEFAULT (0),
    update_datetime DATETIME NOT NULL DEFAULT (GETDATE())
);

Sample MERGE:

MERGE sample_auto_update_log WITH (HOLDLOCK) AS t
USING #changes AS s
ON (t.id = s.id AND t.item_a_before = s.item_a_before AND t.item_a_after = s.item_a_after)
WHEN MATCHED THEN
    UPDATE
    SET update_count = update_count + 1, update_datetime = GETDATE()
WHEN NOT MATCHED THEN
    INSERT (id, item_a_before, item_a_detail, item_a_after)
    VALUES (s.id, s.item_a_before, s.item_a_detail, s.item_a_after);

SQL Fiidle

Question: How can i best optimize the MERGE using indexes or other methods?

Honestly you would be better off not using MERGE for upserts. There are a number of bugs with MERGE and upserts like this is one of them. http://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/ — Sean Lange, May 07 '15 at 21:27
@SeanLange , do you have a good alternative? I've read through several articles, including the one you posted, and it seems like a lot of those issues wouldn't apply in my case. For example, I'm not using primary keys in my target or source table, no variables or table variables, not using it in a trigger, I am already using the `(HOLDLOCK)` hint to prevent race conditions, etc. I also may have overstated the importance of thread safety - performance is a much bigger concern given that it will be unlikely that users are concurrently accessing the same record, but this will be ran a lot. — jreed121, May 07 '15 at 21:53
Why not just issue two statements joining to your temp table? The first is the update followed by the insert. — Sean Lange, May 08 '15 at 13:33
Really only for the reason that I read MERGE can be faster. I wanted to use the method suggested here: http://stackoverflow.com/a/21209131/550595 , but that appears to only be for a single row. I'll probably just use two statements though, if that becomes a problem then I'll write a job that removes duplicates and increments the counter. — jreed121, May 08 '15 at 17:42

Vladimir Baranov · Accepted Answer · 2015-05-09T02:29:05.780

Consider the following approach.

Do simple fast INSERT into the auto_update_log in the procedure that updates your data. Don't care about duplicates in auto_update_log at this stage.

Have another background process that periodically (every few minutes, or whatever is suitable for your system) summarizes what has been accumulated in auto_update_log and updates the final log table with condensed summary without duplicates. Use MERGE for updating the summary this with proper supporting indexes. Clean up the auto_update_log after it has been added to the summary.

In other words, auto_update_log becomes a temporary staging table that allows to update summary in bulk. Information in the summary would be delayed, so you need to decide whether it is acceptable or not.

With the sample MERGE statement that you put in your question I would consider adding index on (id, item_a_before, item_a_after) - those fields that are used in comparison. OR just on id plus item_a_before, item_a_after as included columns. In both tables auto_update_log and your temp table #changes.

Having an index speeds up the process of finding the rows that have to be updated, but it takes time to update the index itself when rows are added. So, ultimately, you need to try and measure the performance of different methods.

This is basically what my backup plan was. I was going to create a job/SP that uses a CTE to partition on those fields and identify the duplicates, then either update the counter of the original row and delete the duplicates, or rewrite the table from the contents of the CTE. The job will only need to clean up every night. I just preferred to have the data clean going in, rather than clean it up afterwards, but it's not that big of a deal - I've already spent way too much time on this. Thanks for suggestion, I'll mark it as the answer since this is what I'll do. — jreed121, May 08 '15 at 18:00
You can try at first to simply add indexes as I described in the second part of the answer. In any case, you need to measure performance before and after the change. — Vladimir Baranov, May 09 '15 at 02:34

How to optimize MERGE using multiple column index

1 Answers1