Will a SQL DELETE with a sub query execute inefficiently if there are many rows in the source table?

Question

I am looking at an application and I found this SQL:

DELETE FROM Phrase
WHERE PhraseId NOT IN(SELECT Id FROM PhraseSource)

The intention of the SQL is to delete rows from Phrase that are not in the PhraseSource table.

The two tables are identical and have the following structure

Id - GUID primary key
... 
... 
...
Modified int

the ... columns are about ten columns containing text and numeric data. The PhraseSource table may or may not contain more recent rows with a higher number in the Modified column and different text and numeric data.

Can someone tell me will this query execute the SELECT Id from PhraseSource for every row in the Phrase table? If so is there a more efficient way that this could be coded.

Do you need PhraseSource table datas that are not present in Phrase table..? — Mansoor, Nov 16 '16 at 07:38
If you want to know what the database *does*, you really ought to read up on `execution plans`. They tell you how the optimizer has actually decided to execute your query - and that's part of the point of SQL - you're telling the system *what to do*, not *how to do it*, and it has a lot of latitude in how it rearranges and re-writes parts of the query to try to achieve efficiency. — Damien_The_Unbeliever, Nov 16 '16 at 07:43
I see you have accepted an answer that conflicts with what I know to be a better way of doing the `DELETE` statement. You can find details in my answer. — TT., Nov 16 '16 at 18:19

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

1. Will this query execute the SELECT Id from PhraseSource for every row?

No.

In SQL you express what you want to do, not how you want it to be done¹. The engine will create an execution plan to do what you want in the most performant way it can.

For your query, executing the query for each row is not necessary. Instead the engine will create an execution plan that executes the subquery once, then does a left anti-semi join to determine what IDs are not present in the PhraseSource table.

You can verify this when you include the Actual Execution Plan in SQL Server Management Studio.

2. Is there a more efficient way that this could be coded?

A little bit more efficient, as follows:

DELETE
    p 
FROM 
    Phrase AS p
WHERE 
    NOT EXISTS (
        SELECT
            1
        FROM
            PhraseSource AS ps
        WHERE
            ps.Id=p.PhraseId
    );

This has been shown in tests done by user Aaron Bertrand on sqlperformance.com: Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?:

Conclusion

[...] for the pattern of finding all rows in table A where some condition does not exist in table B, NOT EXISTS is typically going to be your best choice.

Another benefit of using NOT EXISTS with a correlated subquery is that it does not have problems when PhraseSource.Id can be NULL. I suggest you read up on IN/NOT IN vs NULL values in the subquery. E.g. you can read more about that on sqlbadpractices.com: Using NOT IN operator with null values.

The PhraseSource.Id column is probably not nullable in your schema, but I prefer using a method that is resilient in all possible schemas.

^{1. Exceptions exist when forcing the engine to use a specific path, e.g. with Table Hints or Query Hints. The engine doesn't always get things right.}

score 1 · Accepted Answer · edited May 23 '17 at 12:13

1

In this case the sub-query could be evaluated for each row if the database system is not smart enough (but in case of MS SQL Server, I suppose it should be able to recognize the fact that you don't need to evaluate the subquery more than once).

Still there is a better solution:

DELETE p 
FROM Phrase p
LEFT JOIN PhraseSource ps ON ps.Id = p.PhraseId
WHERE ps.Id IS NULL

This uses the LEFT JOIN which matches the rows of both tables, but in case there is no match it leaves the ps entry NULL. Now you just check for NULLs on the left side to see which Phrases do not have a match and will delete those.

All types of JOIN statements are very nicely described in this answer.

Here you can see three different approaches for a similar issue compared on MySQL. As @Drammy mentions, to actually see the performance of a given approach, you could see the execution plan on your target database and do performance testing on different approaches of the same problem.

edited May 23 '17 at 12:13

Community

1
1

answered Nov 16 '16 at 07:39

Martin Zikmund

38,440
7
70
91

Your "better solution" is actually worse than what the OP had (in terms of performance). If you have any references to back your claim I'd be interested to read them. – TT. Nov 16 '16 at 09:59
But why is it worse preformance-wise? For example for the article I linked it seems both solutions perform the same. – Martin Zikmund Nov 17 '16 at 00:02
From the [article](https://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join): in case a supporting index exists, the left outer join performs a lot worse compared to not exists: factor ~2.7 (!) for a cold cache, ~5.7 (!!) for a warm cache. The difference isn't as much when there is no supporting index, ~1.2 cold, ~1.1 warm. In any case, the OP is looking for an ID field not existing so I'm betting there's a supporting index. PS: Your article is about *MySQL*, not *SQL Server*. The question is tagged with *SQL Server*. – TT. Nov 17 '16 at 06:16
The why? In the [article](https://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join) it becomes clear from the execution plan. The execution plan for a left outer join shows a merge join which is best for typical left joins. The not exists shows a left anti semi join which is best for typical not exists queries. For queries like these the left anti semi join performs better, especially when a supporting index exists. – TT. Nov 17 '16 at 06:22

Drammy · Answer 3 · 2016-11-16T08:37:56.667

1

That query should optimise into a join. Have you looked at the execution plan?

If you're experiencing poor performance it is likely because of the guid primary keys.

A primary key is clustered by default. If the guid primary key is clustered on your table that means the data in the tables is ordered by the primary key. The problem with guids as clustered keys is that when you delete one record the table has to be reordered and shuffled around on disk.

This article is a good read on the topic..

https://blog.codinghorror.com/primary-keys-ids-versus-guids/

edited Nov 16 '16 at 08:37

answered Nov 16 '16 at 07:44

Drammy

940
12
30

"A primary key is clustered..." is rather incomplete. *By default*, the primary key will be clustered, unless another index has already been declared as clustered (and you're not require to accept the default - you can have a heap table and still have a PK). They are *not* tied 1-1 as your answer implies. – Damien_The_Unbeliever Nov 16 '16 at 07:50

Will a SQL DELETE with a sub query execute inefficiently if there are many rows in the source table?

3 Answers3