Not Exists vs Not In: efficiency

Question

I've always been under the assumption that not exists is the way to go instead of using a not in condition. However, I doing a comparison on a query I've been using, I noticed that the execution for the Not In condition actually appears to be faster. Any insight into why this could be the case, or if I've just made a horrible assumption up until this point, would be greatly appreciated!

QUERY 1:

SELECT DISTINCT 
a.SFAccountID, a.SLXID, a.Name FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN  _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
AND c.Primary__C = 0
AND NOT EXISTS
(
SELECT 1 FROM [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK)
WHERE a.SFAccountID = c2.SFAccountID
AND c2.Primary__c = 1
);

QUERY 2:

SELECT   
DISTINCT
a.SFAccountID FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN  _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK) 
ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
AND c.Primary__C = 0
AND a.SFAccountID NOT IN (SELECT SFAccountID FROM [dbo].[Salesforce_Contacts] WHERE Primary__c = 1 AND SFAccountID IS NOT NULL);

Actual Execution plan for Query 1: Execution plan 1

Actual Execution plan for Query 2: Execution plan 2

TIME/IO STATISTICS:

Query #1 (using not exists):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 532 ms, elapsed time = 533 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Contacts'. Scan count 2, logical reads 3078, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'INFORMATION'. Scan count 1, logical reads 691, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ACCOUNT'. Scan count 4, logical reads 567, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Accounts'. Scan count 1, logical reads 680, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 250 ms,  elapsed time = 271 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Query #2 (using Not In):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 500 ms, elapsed time = 500 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Contacts'. Scan count 2, logical reads 3079, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'INFORMATION'. Scan count 1, logical reads 691, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ACCOUNT'. Scan count 4, logical reads 567, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Accounts'. Scan count 1, logical reads 680, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 157 ms,  elapsed time = 166 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

see if this helps http://stackoverflow.com/questions/173041/not-in-vs-not-exists — Vamsi Prabhala, Jul 09 '15 at 23:07
(1) The actual plans look pretty much the same to me. (2) You need to measure actual performance of the query, not estimated performance of the plan. — Gordon Linoff, Jul 09 '15 at 23:16
my experience with very large databases has made me prefer `IN` over `EXISTS`. I also stopped using `CTE` exclusively and use Temp tables more oftern — JamieD77, Jul 09 '15 at 23:25
Oh I'm sorry, that was the actual execution plan. If you zoom in, the plan looks the same, but the % are slightly different. — tchock, Jul 09 '15 at 23:38
Aaron Bertrand wrote an [**article**](http://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join) on this which he concludes that `NOT EXISTS` is **typically** your best choice. — Felix Pamittan, Jul 10 '15 at 02:37
Don't look at the estimates (yes those percentage are estimates even in the actual plan), measure the actual performance instead. Number of reads, duration etc. — Mikael Eriksson, Jul 10 '15 at 06:47
NOT EXISTS is better than NOT IN, in that it causes no issues regarding NULL values. The execution plans are otherwise the same, so NOT EXISTS should be preferred. — TT., Jul 10 '15 at 07:14
You can use [SET STATISTICS TIME](https://msdn.microsoft.com/en-us/library/ms190287.aspx?f=255&MSPPError=-2147217396) for duration and [SET STATISTICS IO](https://msdn.microsoft.com/en-us/library/ms184361.aspx) for reads. — Mikael Eriksson, Jul 10 '15 at 07:16
@MikaelEriksson, as I said in the answer, I really like [SQL Sentry Plan Explorer](http://www.sqlsentry.com/products/plan-explorer/sql-server-query-view). I personally use free version. It is good to know how it actually works under the hood (that it uses SET STATISTICS), but it presents information in a much more user-friendly way. — Vladimir Baranov, Jul 10 '15 at 08:01
@VladimirBaranov Totally agree. I use Plan Explorer as well. — Mikael Eriksson, Jul 10 '15 at 08:03
@MikaelEriksson I've included time/IO statistics in my original post. If you could give me some insight into which query seems to be performing better/faster/etc., I would be super grateful! Thanks for your input so far. :) — tchock, Jul 10 '15 at 22:48
They are the same. Same number of reads and about the same duration. The plans are identical for your queries. — Mikael Eriksson, Jul 12 '15 at 10:12

score 2 · Accepted Answer · answered Jul 09 '15 at 23:18

2

try

SELECT DISTINCT a.SFAccountID, a.SLXID, a.Name 
  FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
  JOIN _SLX_AccountChannel b WITH(NOLOCK)
    ON a.SLXID = b.ACCOUNTID
   AND b.STATUS IN ('Active','Customer', 'Current')
  JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
    ON a.SFAccountID = c.SFAccountID 
   AND c.Primary__C = 0
  LEFT JOIN [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK) 
    on c2.SFAccountID = a.SFAccountID
   AND c2.Primary__c = 1
 WHERE c2.SFAccountID is null

answered Jul 09 '15 at 23:18

paparazzo

44,497
23
105
176

this type of `join` vs `where` doesn't make a whole lot of difference in most cases – JamieD77 Jul 09 '15 at 23:31
2

@JamieD77 And when it does make a difference it is better. I do this for living. – paparazzo Jul 09 '15 at 23:39
This one was quicker. Could I get a brief explanation of why this would be quicker? Thanks! – tchock Jul 09 '15 at 23:43
doing it for a living you'd think you'd mention seperating out the joins into a `subquery` or and `IN` statement instead of doing a `DISTINCT` since all the fields are coming from the `[Salesforce_Accounts]` table – JamieD77 Jul 09 '15 at 23:47
@JamieD77 Maybe when you get to doing it for a living you will learn what a join can cause duplicates it may not even be unique in Salesforce_Accounts. If you have something better post it. You did get the comment for the OP it is faster? – paparazzo Jul 09 '15 at 23:55
`SFAccountID` doesn't sound like a primary key of `[Salesforce_Accounts]` to you? – JamieD77 Jul 10 '15 at 00:00
@parrotsnest Please post query plan. Just give the query optimizer every chance to be smart. You want to filter as early as possible. I do this a lot and it is as much and art as it is a science. – paparazzo Jul 10 '15 at 00:00
I know there are very strong contributors on this site that like not exists. And it does do better some times. But when not exists does better it is minor. When shit gets ugly this is the best approach from my experience. If the query optimizer gets really stupid then go with hash join hint and even a force order. – paparazzo Jul 10 '15 at 00:38

score 1 · Answer 2 · answered Jul 09 '15 at 23:33

1

as far as i understand it, a not in works in the same fashion as two nested for instructions would.

so, asuming you have two tables: table(1000 records) and tabla (2000 records),

select * from table where table.field not in (select field from tabla)

is like doing

for (int i = 0;  i < 1000; i++) {
   for (int j = 0;  j < 2000; j++) {
   }
}

that is 1000*2000 = 2 million operations.

the left join with tabla.field is null trick, again, as far as i understand it, makes only 2000 operations

Use left join.

answered Jul 09 '15 at 23:33

Prefijo Sustantivo

515
9
22

2

In a world without query optimizers where everything was in memory, sure. In the real world, not so much.... – Billy ONeal Jul 09 '15 at 23:36
1

do the experiment in a local database. fill it generously, i'd say 100.000 records for each table. measure times for both options (with not in and with left join), what times do you get? – Prefijo Sustantivo Jul 09 '15 at 23:42

Eric · Answer 3 · 2015-07-10T06:10:34.397

I think the missing index cause the difference for EXISTS() and IN operations.

Although the question do not ask for a better query, but for me I'll try to avoid the Distinct like this

SELECT
    a.SFAccountID, a.SLXID, a.Name 
FROM 
    [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
    CROSS APPLY 
    (
        SELECT SFAccountID 
        FROM [dbo].[Salesforce_Contacts] WITH(NOLOCK) 
        WHERE SFAccountID  = a.SFAccountID 
        GROUP BY SFAccountID
        HAVING MAX(Primary__C + 0) = 0 -- Assume Primary__C is a bit value
    ) b
WHERE
    -- Actually it is the filtering condition for account channel
    EXISTS
    (
        SELECT * FROM _SLX_AccountChannel WITH(NOLOCK) 
        WHERE ACCOUNTID = a.SLXID AND STATUS IN ('Active','Customer', 'Current')
    )

Vladimir Baranov · Answer 4 · 2015-07-10T03:00:54.273

The question was: "why NOT IN appears to be faster than NOT EXISTS".

My answer is: it only appears to be faster, but it is the same. (in this case)

Did you actually measure the time for both queries and confirm that there is a difference?

OR you just looked at the execution plans?

As far as I understand, the query cost that you see on the screenshots (53% vs 47%) is:

estimated query cost, even though the plans are actual;
it is the query cost, not time, which is combined from CPU and IO "costs".

It seems that in this particular case query optimizer generated almost identical plans for both queries. It is quite likely that the plans differ (slightly) in estimated number of rows for some operators in the plans, but actual performance is the same, because the plan shape is the same. If the estimated number of rows differ, it would lead to different estimated query cost that you see.

To see the difference in plans (if any), I would use a tool like SQL Sentry Plan Explorer. It shows more details and you can compare all aspects of the queries easier.

Rewriting the query to be faster is a different question and I don't attempt to answer it here.

JamieD77 · Answer 5 · 2015-07-10T01:47:41.380

0

This is assuming that you're trying to find Accounts that do not have a Primary Contact and there can only be one Primary Contact

SELECT  a.SFAccountID, a.SLXID, a.Name 
FROM    [dbo].[Salesforce_Accounts] a
        LEFT JOIN [dbo].[Salesforce_Contacts] c ON a.SFAccountID = c.SFAccountID AND c.Primary__C = 1
WHERE
        EXISTS (SELECT  * 
                FROM SLX_AccountChannel b 
                WHERE b.ACCOUNTID = a.SLXID 
                    AND b.STATUS IN ( 'Active', 'Customer', 'Current' ))
        AND c.SFContactID IS NULL

If you want accounts that have contacts but no primary contact you can use

SELECT 
    a.SFAccountID ,
    a.SLXID ,
    a.Name
FROM 
    [dbo].[Salesforce_Accounts] a
WHERE
    a.SFAccountID IN (SELECT SFAccountID 
                    FROM [Salesforce_Contacts] 
                    GROUP BY SFAccountID 
                    HAVING SUM(CAST(Primary__c AS INT) = 0))

    AND a.SLXID IN (SELECT ACCOUNTID 
                    FROM _SLX_AccountChannel 
                    WHERE [STATUS] IN ( 'Active', 'Customer', 'Current' ))

edited Jul 10 '15 at 01:47

answered Jul 10 '15 at 00:15

JamieD77

13,796
1
17
27

Hey trash talk. -1 You totally miss [dbo].[Salesforce_Contacts].Primary__C = 0 – paparazzo Jul 10 '15 at 00:26
how do you know that he only wants Accounts that have 1 contact that is not primary? – JamieD77 Jul 10 '15 at 00:27
1

There is the "AND c.Primary__C = 0". I don't know what he wants but I do know what the query does. – paparazzo Jul 10 '15 at 00:31
@Blam when you grow up a little, we can talk about why joining 3 tables together and then using distinct because you really only needed the information from 1 table is a bad practice – JamieD77 Jul 10 '15 at 00:36
moderators please don't hold this against @Blam I make a pretty good living rewriting queries like his to fix performance issues, so I feel pretty good about my job security as long as he's around to help others. – JamieD77 Jul 10 '15 at 01:17
Performance issues? Start with the correct answer. You don't get a join can eliminate rows or cause duplicates? Keep the job you have. For the third time what about c.Primary__C = 0 that you totally ignore? – paparazzo Jul 10 '15 at 01:29
no distinct and c.Primary__C = 0 and status in ( 'Active', 'Customer', 'Current' )) .. you going to keep creating more accounts so you can keep downvoting me? – JamieD77 Jul 10 '15 at 01:49
whats funny is i wasn't criticizing your query just pointing out that having things like `AND b.STATUS IN ('Active','Customer', 'Current')` and `AND c.Primary__C = 0` in the join condition as opposed to the `WHERE` clause never added any performance boosts for me. of course the `left join` works better than `not exists` most of the time so his query ran faster.. but probably would run just as fast if those conditions were still in the `Where`. Carry on with your PMS fit though – JamieD77 Jul 10 '15 at 01:56

score 0 · Answer 6 · answered Jul 10 '15 at 02:55

You can do without hitting/joining Salesforce_Contacts more than once. This is more compact and faster:

SELECT a.SFAccountID, a.SLXID, a.Name
FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN  _SLX_AccountChannel b WITH(NOLOCK)
    ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
    ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
GROUP BY a.SFAccountID, a.SLXID, a.Name
HAVING MAX(c.Primary__C) = 0

Difference between IN and EXISTS is neglectable.

Not Exists vs Not In: efficiency

6 Answers6