Improving horrible MERGE performance

Question

I recently asked a question about how to solve a problem in tsql query which led me to use a MERGE statement. This is however proving problematic as its performance is horrible.

How do I get inserted row Id, along with related data back after insert without inserting the related data

What I need to do is insert rows based on a result set and save away the id of the inserted row along with the data that it resulted from (see related question).

I ended up with a query like this.

DECLARE @temp AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100)
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

;WITH person AS
(
    SELECT top 1
        t.[Personnumber]
        ,t.[Firstname]
        ,t.[Lastname]
    FROM [temp].[RawRoles] t
    WHERE t.Personnumber NOT IN 
        (
            SELECT i.Account FROM [security].[Accounts] i
        )
)

MERGE [security].[Identities] AS tar
USING person AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out using CTE Query.
WHEN NOT MATCHED THEN
   INSERT
   (
     [Created], [Updated]
   )
   VALUES
   (
        GETUTCDATE(), GETUTCDATE()
   )
OUTPUT $action, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname]  INTO @temp;


SELECT * FROM @temp

Using this query I'm inserting all the rows and then saving them away to a temp table along with the source values for later processing.

This works great on sub 10k rows. But the data set I'm doing this against is close to 2 million rows. I ran this query for about an hour without it completing (on a juiced up premium tier Azure database).

Question: How can I make this faster. Can I achieve this same result without Merge?

Why 3 `TOP 1` queries with not `ORDER BY`..? What is `@sourceData`, as that isn't defined in your SQL. — Thom A, Dec 04 '19 at 12:25
what does the execution plan say, because I think you'll be doing a full table scan on that merge. You'll get better perf by doing a normal insert, and using the output clause of INSERT to get the same results into your temp table — gbjbaanb, Dec 04 '19 at 12:30
@PanagiotisKanavos well not exactly, the source data was different, but that was never the problem. Its the merge insert that is the problem. The whole query up until the merge takes 20 seconds (that was same with the previous query as well). — JensB, Dec 04 '19 at 12:30
Anyway, schema, indexes, execution plan. Preferably statistics too. Without them it's impossible to answer, especially now that the query doesn't show anything strange. Although I'd still use `INSERT SELECT`, not MERGE — Panagiotis Kanavos, Dec 04 '19 at 12:30
@JensB it is, really. Those TOP 1 subqueries require individual execution in a loop. Simplifying them to a single GROUP BY with MIN/MAX would be a lot faster. There's nothing strange in the question as is now, so it's impossible to say why it's slow — Panagiotis Kanavos, Dec 04 '19 at 12:32
When posting the query plan, please do you [Paste the Plan](https://www.brentozar.com/pastetheplan/). It makes it far easier for *everyone*. — Thom A, Dec 04 '19 at 12:32
What's the point of MERGE here? The match clause is always false. Why not just use `INSERT INTO Identities (...) OUTPUT inserted..... Select A,B C from RawRoles LEFT JOIN Accounts ON Account=PersonNumber Where Accounts.ID IS NULL` ? — Panagiotis Kanavos, Dec 04 '19 at 12:37
Side note: Personally I suggest against using `NOT IN` with a subquery. If *any* of the rows return have the value `NULL`, then the behaviour you get will not be what you expect. You are far better off using a `NOT EXISTS` or a `LEFT JOIN` and testing for `NULL`. [DB<>Fiddle](https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=ed41bd8f6610075f2a2134016a879944) — Thom A, Dec 04 '19 at 12:38
@PanagiotisKanavos I might be wrong, But I thought that Output could only return data from the table you inserted to, and not from columns that were not inserted. That is what Im doing with the OUTPUT, where I'm selecting source data and not just inserted data. — JensB, Dec 04 '19 at 12:39
That's correct, @JensB . Using a `MERGE` is a "work around" for that; as it *can* `OUTPUT` data from the "input" objects. — Thom A, Dec 04 '19 at 12:40
That is exactly what I need to do. I want to get the Id of the inserted row as well as data from the original row, data which will not be inserted (into this table but in another table with a FK to this one). — JensB, Dec 04 '19 at 12:41

nvogel · Answer 1 · 2019-12-04T13:15:51.987

It looks to me like your Identity table is just being used as a sequence generator because you are inserting nothing to it other than timestamps. Have you considered using SEQUENCE instead of a table to generate the keys? Using a sequence could possibly eliminate this process because you could generate the key whenever it was needed.

Outputting millions of rows to a table variable is unlikely to be workable. Table variables are generally good for a few thousand rows at most.

INSERT INTO security.Accounts (GlobalId, Account, Firstname, Lastname)
SELECT NEXT VALUE FOR AccountSeq, r.Personnumber, r.Firstname, r.Lastname
FROM temp.RawRoles AS r
LEFT JOIN security.Accounts AS a ON r.Personnumber = a.Account
WHERE a.Personnumber IS NULL;

INSERT INTO security.identities (GlobalId, Created, Updated)
SELECT a.GlobalId, GETUTCDATE() AS Created, GETUTCDATE() AS Updated
FROM security.Accounts AS a
LEFT JOIN security.identities AS i ON a.GlobalId = i.GlobalId
WHERE i.GlobalId IS NULL;

Good thought. I have been wondering If I could save myself some time by using a GUID instead and creating this inline without first needing to insert into that table. — JensB, Dec 04 '19 at 12:49

lptr · Accepted Answer · 2019-12-04T21:34:27.603

At first glance, the MERGE does not seem to be the culprit for the degraded performance. The merge condition is always false (0=1) and insertion (into [security].[Identities]) is the only possible path/way forward.

How long does it take to insert 2million rows into @temp, bypassing [security].[Identities] and MERGE ?

DECLARE @temp AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100)
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

--is this fast?!?
INSERT INTO @temp(action, GlobalId, Personnumber, Firstname, LastName)
SELECT 'insert', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
WHERE t.Personnumber NOT IN 
(
    SELECT i.Account FROM [security].[Accounts] i
);

To check:

What is the datatype of [temp].[RawRoles].Personnumber ? is Personnumber nvarchar(100)?
Do you need to store foreign characters in a person's number? Nchar is double the size of char. varchar/char could be a better choice if you have alphanumeric(common latin characters) or digits with leading zeros. If your requirements can be met with a numeric datatype then int/bigint/decimal would be preferred.
Is there an index on [temp].[RawRoles].Personnumber? Without an index, the existence check would need to sort [temp].[RawRoles].Personnumber or hash it. This could be an added cost for the resource throughput/dtu. A clustered index on [temp].RawRoles could be the most beneficial, considering the majority of temp.RawRoles will get finally processed/inserted.
What is the datatype of [security].[Accounts].Account? Is there an index on the column? The two columns [security].[Accounts].Account & [temp].[RawRoles].Personnumber should be of the same datatype and ideally with an index on both. If [security].[Accounts] is the final destination of the processed [temp].[RawRoles], then the table could hold millions of rows and an index on the Account column is required for any future processing. The downside of the index is slower inserts. If the 2 million is the very first bulk/data, it would be best not to have an index on Account when inserting the "bulk" into security.Accounts (but create it afterwards).

To sum up:

--contemplate&decide whether a change of the Account datatype is needed. (a datatype change can have many implications, for applications using the db)

--change the data type of Personnumber to the datatype of Account(security.Accounts)
ALTER TABLE temp.RawRoles ALTER COLUMN Personnumber "datatype of security.Accounts.Account" NOT NULL; -- rows having undefined Personnumber?

--clustered index Personnumber
CREATE /*UNIQUE*/ CLUSTERED INDEX uclxPersonnumber ON temp.RawRoles(Personnumber); --unique preferred, if possible

--index on account (not needed[?] when security.Accounts is empty)
CREATE INDEX idxAccount ON [security].Accounts(Account);


--baseline, how fast can we do a straight forward insertion of 2 million rows?
DECLARE @tempbaseline AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100) --ignore this for now
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

INSERT INTO @tempbaseline([action], GlobalId, Personnumber, Firstname, LastName)
SELECT 'INSERT', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
    WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)    
--if the execution time (baseline) is acceptable, proceed with the merge code
--"merge with output into" should be be "slightly"/s slower than the baseline.
--if the baseline is not acceptable (simple insertion takes too much time) then merge is futile

/*
DECLARE @temp....


MERGE [security].[Identities] AS tar
USING 
(
    SELECT --top 1
        t.[Personnumber]
        ,t.[Firstname]
        ,t.[Lastname]
    FROM [temp].[RawRoles] t
    WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)
) AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out in the USING Query.
WHEN NOT MATCHED THEN
   INSERT
   (
     [Created], [Updated]
   )
   VALUES
   (
        GETUTCDATE(), GETUTCDATE()
   )
OUTPUT 'INSERT' /** only insert is possible $action */, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname]  INTO @temp;   


--delete the index on Account (the process will insert 2mil)
DROP INDEX idxAccount ON [security].Accounts --review and create this index after the bulk of accounts is inserted.

...your process

*/

Improving horrible MERGE performance

2 Answers2