Query optimization: Slow query is faster when split up

Question

I've experienced a T-SQL query taking a significant amount of time. If I split it up into 2 queries, it's faster, but I don't know why.

The query is meant to fetch data from tables C, G, Y and insert it into X - but only if it doesn't already exist in X.

Here's the slow query. It slows down significant when the X-table grows:

INSERT INTO X (COLUMN_A, COLUMN_B, COLUMN_C) 
    (SELECT a1.COLUMN_A, a1.COLUMN_B, 'QWE' as COLUMN_C 
     FROM 
         (SELECT 
              CAST(G.ID AS VARCHAR(900)) AS COLUMN_A, 
              MAX(Y.B) AS COLUMN_B
          FROM 
              C
          JOIN 
              G ON (G.ID = C.G_UID)
          JOIN 
              Y ON Y.B COLLATE Danish_Norwegian_BIN = C.UID COLLATE Danish_Norwegian_BIN
          GROUP BY 
              CAST(G.ID AS VARCHAR(900))) AS a1 
     WHERE 
         a1.COLUMN_A NOT IN (SELECT COLUMN_A FROM X))

BUT! It's faster if I split the query up (in code) into 2 sets:

Set 1:

SELECT  
    a1.COLUMN_A, a1.COLUMN_B, 'QWE' as COLUMN_C 
FROM 
    (SELECT 
         CAST(G.ID AS VARCHAR(900)) AS COLUMN_A, MAX(Y.B) AS COLUMN_B
     FROM 
         C
     JOIN 
         G ON (G.ID = C.G_UID)
     JOIN 
         Y ON Y.B COLLATE Danish_Norwegian_BIN = C.UID COLLATE Danish_Norwegian_BIN
     GROUP BY 
         CAST(G.ID AS VARCHAR(900))) AS a1

Set 2:

SELECT COLUMN_A FROM X

And then manually subtract the sets from each other, and inserting the result into the X-table.

Why is that? Can my slow query be written better?

@MartinSmith I've uploaded the execution plans here https://imgur.com/a/rH7nZzQ. I've crossed out the table names, I hope you can still decipher which is which (else just ask). I've rewritten the query using Gordon Linoff's suggestions and it should now be using the clustered index in the "NOT EXISTS" clause. I'm currently running a test to see if it has improved performance. — Jon, May 18 '20 at 21:24
Looks like a straightforward anti semi join without any extra apparatus to deal with NULL so I imagine NOT EXISTS may well give the same plan. Look at the estimated vs actual executions on the operators on the inside of the nested loops operators. Particularly that scan and the seek with high estimated cost. Any big discrepancies? — Martin Smith, May 18 '20 at 22:00
@MartinSmith I've done as you suggested. This image should show you the estimated vs. actual executions. https://imgur.com/dvNTU9d The query took 6 minutes and 45 seconds. It might not look exactly like the previous posted execution plans due to the joining tables can change. — Jon, May 19 '20 at 08:14
The image shows that there are 1,624,523 rows going into the anti semi join - so it is doing 1,624,523 clustered index scans on `X` (though the scans may be partial as it can stop if it finds a match - otherwise they are full scans). Can you add an index on `X(COLUMN_A )` so those scans become seeks? — Martin Smith, May 19 '20 at 10:03
@MartinSmith I've used the NOT EXISTS in the WHERE-clause and using a composite index (Column_A is actually multiple columns, I've just simplified it in my post). The Execution plan is now: https://imgur.com/NasXv7a — Jon, May 19 '20 at 12:24
@MartinSmith I added an non clustered index to `X(COLUMN_A, ...)` and it worked great. My queries are faster than my previous code, so I would love to accept your answer if you could post one. — Jon, May 20 '20 at 00:07

score 0 · Answer 1 · answered May 18 '20 at 19:28

0

NOT IN is likely to be really slow. I would recommend NOT EXISTS:

 where not exists (select 1 from x where x.column_a = a1.column_a)

This can also take advantage of an index on x(column_a).

answered May 18 '20 at 19:28

Gordon Linoff

1,242,037
58
646
786

I'll try it! Thanks - I'll give feedback after it's done. – Jon May 18 '20 at 19:37
1

Hi Gordon. Why is `NOT IN` likely to be slow and `NOT EXISTS` not? Except for the null trap with `NOT IN`, they do the same thing. Is SQL Server's optimizer notorious for having problems with `NOT IN`? – Thorsten Kettner May 18 '20 at 19:40
1

@ThorstenKettner because of the different semantics with NULL. SQL Server adds extra operators to NOT IN plans if either of the columns involved are nullable. Some examples of that here https://stackoverflow.com/a/11074428/73226. If the columns aren't nullable the plans should be the same but NOT EXISTS is the safe default choice not to have to worry about it – Martin Smith May 18 '20 at 20:21
But we don't know whether or not that is the issue here as no execution plans or DDL – Martin Smith May 18 '20 at 20:22
I've uploaded the execution plans here https://imgur.com/a/rH7nZzQ – Jon May 18 '20 at 21:59

Query optimization: Slow query is faster when split up

1 Answers1