NOT EXISTS vs NOT IN

Question

I'm optimizing some SQL queries (this could be considered part 2 of a question i recently posted) and replacing some NOT IN with NOT EXISTS predicates

Am I right in thinking that the main benefit to doing so is that with NOT EXISTS you get the benefit that the statement will terminate when a single match is found, but NOT IN with a counting subquery would would have to do a full table scan?

It also seems that NOT IN would also require extra work if the data selected contained NULLs, is this correct?

I need to make sure that the second statement better than the first (and functionally equivalent) in these two cases before I implement them in the proc:

Case 1:

        --exclude sessions that were tracked as part of a conversion during the last response_time minutes
        -- AND session_id NOT IN (SELECT DISTINCT tracked_session_id    
        --                              FROM data.conversions WITH (NOLOCK)
        --                              WHERE client_id = @client_id
        --                              AND utc_date_completed >= DATEADD(minute, (-2) * cy.response_time, @date)
        --                              AND utc_date_completed <= @date     
        --                              AND utc_date_clicked <= @date)

        AND NOT EXISTS (SELECT 1
                            FROM data.conversions WITH (NOLOCK)
                            WHERE client_id = @client_id
                            AND utc_date_completed >= DATEADD(minute, (-2) * cy.response_time, @date)
                            AND utc_date_completed <= @date
                            AND utc_date_clicked <= @date
                            AND data.conversions.tracked_session_id = d.session_id
        )

Case 2:

        -- NOT EXISTS vs full table scan with COUNT(dashboard_id)                                   
        -- AND (SELECT COUNT(dashboard_id)
        --          FROM data.dashboard_responses WITH(NOLOCK)
        --          WHERE session_id = d.session_id
        --          AND cycle_id = cy.id
        --          AND client_id = @client_id) = 0

        AND NOT EXISTS(SELECT 1
                            FROM data.dashboard_responses
                            WHERE session_id = d.session_id
                            AND cycle_id = cy.id
                            AND client_id = @client_id)

Cheers

Does the execution plan not tell you if NOT IN executes a table scan? I would personally look at what the plan says regarding performance/IO statistics over anything else. — Darren, Jan 23 '13 at 11:51
Unfortunatly I can't (easily) run these stored procedures against the data-source to get a query plan — managedheap84, Jan 23 '13 at 11:53
Starting with Oracle 10g IN/NOT explicitly converted to EXISTS/NOT. The NOT IN is as efficient as NOT EXISTS. But do not use NOT IN if your subquery may return NULLS. — Art, Jan 23 '13 at 13:46

GarethD · Accepted Answer · 2013-01-23T12:44:01.050

As you have rightly said the two are different things. If the subquery of items to not be IN contains NULL no results will be returned because nothing equals NULL and nothing does not equal NULL (Not even NULL).

Assuming you are using the two to achieve the same result, there is no difference between the two as long as you handle NULL values in your IN statement. The optimiser is clever enough to know that with NULL values eliminated, or with non nullable columns the two are the same, so use the same ANTI SEMI JOIN.

Consider these two tables:

CREATE TABLE T (ID INT NOT NULL PRIMARY KEY);
CREATE TABLE T2 (ID INT NOT NULL PRIMARY KEY);

These two queries get exactly the same execution plan:

SELECT  *
FROM    T
WHERE   ID NOT IN (SELECT ID FROM T2);

SELECT  *
FROM    T
WHERE   NOT EXISTS (SELECT ID FROM T2 WHERE T.ID = T2.ID);

because the optimiser knows T2.ID is a non nullable column. With a third table:

CREATE TABLE T3 (ID INT);

where the ID column is neither indexed or nullable these two queries render very different execution plans:

SELECT  *
FROM    T
WHERE   ID NOT IN (SELECT ID FROM T3);

SELECT  *
FROM    T
WHERE   NOT EXISTS (SELECT ID FROM T3 WHERE T.ID = T3.ID);

and NOT EXISTS will be much more efficient. However these two again yield (essentially) the same execution plan:

SELECT  *
FROM    T
WHERE   ID NOT IN (SELECT ID FROM T3 WHERE T3.ID IS NOT NULL);

SELECT  *
FROM    T
WHERE   NOT EXISTS (SELECT ID FROM T3 WHERE T.ID = T3.ID);

All these queries and sample data are on SQL Fiddle

EDIT

To actually answer your question:

Case 1 will be the same performance with NOT IN or NOT EXISTS if tracked_session_id is a non nullable column in data.conversions, or you add WHERE tracked_Session_id IS NOT NULL inside the In statement. If the column is not nullable and you don't exclude null values the performance won't be the same, and assuming there are no nulls NOT EXISTS will perform better, if there are no nulls the result won't be the same so the performances are not comparable.

Case 2 actually surprised me with sample data, I had assumed that this would not be optimised into an ANTI SEMI JOIN, and had already written an answer saying as much, but just before saving the edit I thought I'd better check, and was surprised to see that this:

SELECT  *
FROM    T
WHERE   (   SELECT  COUNT(*) 
            FROM    T3
            WHERE   T.ID = T3.ID
        ) = 0;

Is optimised exactly the same as NOT EXISTS. So it appears the optimiser is even more clever than I thought, it will only generate a different plan if you want the count to be something other than 0.

SQL Fiddle for Case 2

In mssql 2005 they added an optimisation for `count(*) = 0` and `count(*) > 0` due to many people using that syntax for IF statements. Though exists is the better choice due to being technically correct. http://blogs.technet.com/b/wardpond/archive/2007/08/27/if-exists-select-vs-if-select-count-1-0.aspx — Chris Chilvers, Jan 23 '13 at 16:33

score 2 · Answer 2 · answered Jan 23 '13 at 11:59

You are correct that there is a big difference with nulls. A NOT IN query checks that each element definitively does not match. A comparison with null doesn't produce a definitive result. Thus if your subquery contains a null, nothing will be considered "NOT IN" it.

See this SQL Fiddle example.

The unintuitive side effect of this behavior is that NOT IN is not actually the opposite of IN.

A NOT EXISTS query does not have this problem.

I would be hesitant to make any blanket statement about which performs better, because that often depends on what sort of optimizations occur. That is why being able to find out the execution plan is important if you care about performance.

NOT EXISTS vs NOT IN

2 Answers2