11

I have an insert-select statement that needs to only insert rows where a particular identifier of the row does not exist in either of two other tables. Which of the following would be faster?

INSERT INTO Table1 (...)
SELECT (...) FROM Table2 t2
WHERE ...
   AND NOT EXISTS (SELECT 'Y' from Table3 t3 where t2.SomeFK = t3.RefToSameFK)
   AND NOT EXISTS (SELECT 'Y' from Table4 t4 where t2.SomeFK = t4.RefToSameFK AND ...)

... or...

INSERT INTO Table1 (...)
SELECT (...) FROM Table2 t2
WHERE ...
   AND t2.SomeFK NOT IN (SELECT RefToSameFK from Table3)
   AND t2.SomeFK NOT IN (SELECT RefToSameFK from Table4 WHERE ...)

... or do they perform about the same? Additionally, is there any other way to structure this query that would be preferable? I generally dislike subqueries as they add another "dimension" to the query that increases runtime by polynomial factors.

KeithS
  • 70,210
  • 21
  • 112
  • 164
  • What is the nullability of all columns involved? Execution Plans? I'd expect `NOT EXISTS` to be at least as fast as `NOT IN` [as per my answer here](http://stackoverflow.com/questions/173041/not-in-vs-not-exists/11074428#11074428) – Martin Smith Sep 19 '12 at 18:26
  • The question should not be "which is faster", but "which is correct". If they are equivalent, there should not be a difference in performance anyway. If they are not equivalent, they could yield different results (at least one of them wrong) – wildplasser Sep 19 '12 at 18:54

4 Answers4

13

Usually it does not matter if NOT IN is slower / faster than NOT EXISTS, because they are NOT equivalent in presence of NULL. Read:

NOT IN vs NOT EXISTS

In these cases you almost always want NOT EXISTS, because it has the usually expected behaviour.

If they are equivalent, it is likely that your database already has figured that out and will generate the same execution plan for both.

In the few cases where both options are aquivalent and your database is not able to figure that out, it is better to analyze both execution plans and choose the best options for your specific case.

Community
  • 1
  • 1
gpeche
  • 21,974
  • 5
  • 38
  • 51
  • We can assume for the present that the identifiers being joined will never be null; they are required data for all records on tables that have them. – KeithS Sep 19 '12 at 19:49
  • 1
    Then, if you have the `NOT NULL` contraints defined, any decent DB should give you the same execution plan for both queries. If that does not happen, you will have to analyze and profile both queries and select the best, as plan execution is very dependant on size and distribution of your data set. – gpeche Sep 19 '12 at 20:00
1

You could use a LEFT OUTER JOIN and check if the value in the RIGHT table is NULL. If the value is NULL, the row doesn't exist. That is one way to avoid subqueries.

SELECT (...) FROM Table2 t2
LEFT OUTER JOIN t3 ON (t2.someFk = t3.ref)
WHERE t3.someField IS NULL
Mark Sherretta
  • 10,160
  • 4
  • 37
  • 42
  • Generally this is less efficient in SQL Server. If you are lucky it will convert it to an anti semi join (same as the `NOT EXISTS`). In the worst case SQL Server LEFT JOINs everything and filters the NULLs out after which can be much more inefficient (Example of that [here](http://bradsruminations.blogspot.co.uk/2011/10/t-sql-tuesday-023-flip-side-of-join.html)) – Martin Smith Sep 19 '12 at 18:29
  • Thanks for the information Martin. I was not aware of that, but wouldn't this be the only way to avoid the "subquery" nature of NOT IN or NOT EXISTS? – Mark Sherretta Sep 19 '12 at 18:31
  • Probably but `NOT EXISTS` is a sub query in syntax only, If there was different syntax for it like `LEFT ANTI SEMI JOIN` that would better reflect the execution plan (NB: I am commenting from a SQL Server perspective) – Martin Smith Sep 19 '12 at 18:33
  • I have some data with 800,000 in the primary table when check the FK for null there are only 12 returned. I get a slightly better response time with this compared to not in or except (1.1 seconds compared to 1.2). If I use with (no lock) both are 0.6 seconds. The join is to a PK. If not a PK I agree this could get bad. – paparazzo Sep 19 '12 at 19:12
1

It's dependent on the size of the tables, the available indices, and the cardinality of those indices.

If you don't get the same execution plan for both queries, and if neither query plans out to perform a JOIN instead of a sub query, then I would guess that version two is faster. Version one is correlated and therefore would produce many more sub queries, version two can be satisfied with three queries total.

(Also, note that different engines may be biased in one direction or another. Some engines may correctly determine that the queries are the same (if they really are the same) and resolve to the same execution plan.)

Larry Lustig
  • 49,320
  • 14
  • 110
  • 160
  • It's tagged TSQL so I assume the OP is specifically interested in SQL Server (though could mean Sybase I guess) – Martin Smith Sep 19 '12 at 18:35
  • Yes, noticed that just as I hit "post". I'll leave the statement there as it may be of interest to others who come across the comment, but I'll move it out of the body of the message. – Larry Lustig Sep 19 '12 at 18:36
0

For bigger tables, it's recomended to use NOT EXISTS/EXISTS, because the IN clause runs the subquery a lot of times depending of the architecture of the tables.

Based on cost optimizer:

There is no difference.

David Castro
  • 1,773
  • 21
  • 21