confusing performance gain with nested select * not exists

Question

The two tables that I am querying against both have ~150 million rows each.

The following statement I terminate after it does not return for 45 minutes, so I don't know how long it would run:

select * from Cats cat  
where not exists( select dog.foo,dog.bar from Dogs dog
                  where cat.foo = dog.foo   
                  and cat.bar = dog.bar);

however this query executes in about 3 minutes:

select * from Cats outside  
   where not exists(select * from Cats cat  
                     where exists( select dog.foo,dog.bar from Dogs dog
                      where cat.foo = dog.foo   
                      and cat.bar = dog.bar)));

My question is what is going on behind the scenes that I am seeing this performance gain?

Reasoning behind returning the same result set:

The first query (slow) states to give all elements that don't exist based on the Cats table.

The second query (fast) states to give all elements that dont exist from the subset of Cats that do exist.

I expect the following query:

select dog.foo,dog.bar from Dogs dog
                          where cat.foo = dog.foo   
                          and cat.bar = dog.bar

to return [A,B,C]

this is common to both functions.

My cat table has the following: [A,B,C,D,E]

I would expect the following query:

 select * from Cats cat  
                     where exists

to return [A,B,C] and the final piece:

select * from Cats outside  
       where not exists

to return [D,E]

UPDATE

Set notation to mathematically prove my claims (please correct me if I have used the wrong symbols):

∀ Cat (Ǝ cat ≠ Ǝdog)

For all elements in Cat, return the set containing each element of cat that does not equal an element in dog

∀ Cat (Ǝ cat = Ǝdog)

For all elements in Cat, return the set containing each element of cat that does equal an element in dog

∀ Cat (Ǝ innerCat ≠ Ǝcat)

For all elements in Cat, return the set containing each element of inner cat that does not equal an element in cat

Second update

I see that my math did not line up with my SQL.

@podiluska when I do an explain plan it does not generate anything. When I run the select statement to verify the plan that is. — Woot4Moo, Aug 03 '12 at 14:12
They do return different result sets, which may explain the performance gain.... — podiluska, Aug 03 '12 at 14:15
I tried with a small result set. The first returns entries which are in cats, but not dogs. The second, nothing at all, so far as i can see. — podiluska, Aug 03 '12 at 14:19
@podiluska yes correct, my math did not line up with my SQL :( — Woot4Moo, Aug 03 '12 at 15:30

score 3 · Answer 1 · answered Aug 03 '12 at 14:24

Apparently, NOT IN and NOT EXISTS are problematic for data engines to optimize. Technically, these are referred to as anti-joins (as distinguished from equi-joins, semi-joins, non-equijoins, and so on).

When a join is hard to optimize, engines resort to nested loop joins. These are usually the worst performing kind (although in SQL Server execution plans, these often look the same, because SQL Server calls index look ups "nested loop" in the execution plan).

What is the difference between these two queries? The first only has a NOT EXISTS, so it is probably doing something inefficient. The second is doing an EXISTS on the inner most subquery. This gets optimized first, basically as a join. If the keys have indexes all is well. SQL Server could also choose hash-based or merge-based algorithms for these.

The "not exists" in the second version is based on the same table. This might give SQL Server more room for optimization.

Finally, the second version might be reducing the set of data considerably. If so, even a nested loop join on the outside might go much faster.

score 2 · Answer 2 · answered Aug 03 '12 at 14:28

2

The second query is far more optimal when executed, and this is why:

You alias the outer query's Cats table to outside, but you don't refer to outside in the clause of your where not exists. So, SQL can do the following:

find any single cat where cat.foo = dog.foo and cat.bar = dog.bar (from your innermost query)
this means that there does exist a cat that meets your outer where not exists for all cats in outside
therefore the where not exists clause is false for all rows in outside
therefore the result for the query is empty

Your first query has to re-execute the nested query for every cat in the table, and is therefore slower.

answered Aug 03 '12 at 14:28

Dan Puzey

33,626
4
73
96

I saw my issue was that my math (stated above) did not identically match my query. This helped me to see what I was doing wrong. – Woot4Moo Aug 03 '12 at 15:31
Is there a way to avoid re-executing the inner query? It would be beneficial if at most I have to do that query one time. – Woot4Moo Aug 03 '12 at 16:27

score 1 · Answer 3 · edited May 23 '17 at 12:12

1

The answer to your question would be to check the execution plans.

As a sidenote, you should try this equivalent query (also see https://stackoverflow.com/a/1069467/44522):

SELECT * FROM Cats cat LEFT OUTER JOIN Dogs dog 
    ON cat.foo = dog.foo and cat.bar = dog.bar
WHERE dog.foo IS NULL and dog.bar IS NULL

I bet it will execute way faster (assuming you got the right indexes).

edited May 23 '17 at 12:12

Community

1
1

answered Aug 03 '12 at 14:22

MicSim

26,265
16
90
133

Do you need the (and dog.bar is null) ? – podiluska Aug 03 '12 at 14:28
Yes. You could have dog.foo = NULL and dog.bar = 'xyz', which would be in the result set, if you don't explicitly exclude it. Admitted, it's a very unlikely situation if you are joining over primary/foreign keys. – MicSim Aug 03 '12 at 14:32

score 0 · Accepted Answer · answered Aug 07 '12 at 12:58

I have discovered through testing that this is the most efficient way of performing the queries in the initial question:

Select cat.foo,cat.bar from cats cat

MINUS

Select dog.foo,dog.bar from dogs dog

This works because none of my columns are nullable.

podiluska · Answer 5 · 2012-08-03T14:39:42.877

-1

They're different queries with different results. To make the second return the same as the first, it needs to be something like...

   select * from cats outside   
   where not exists(select * from Cats cat   
                     where exists( select dog.foo,dog.bar from Dogs dog 
                      where cat.foo = dog.foo    
                      and cat.bar = dog.bar)
                      and outside.foo = cat.foo
                      and outside.bar=cat.bar
                      )

edited Aug 03 '12 at 14:39

answered Aug 03 '12 at 14:17

podiluska

50,950
7
98
104

The second query checks for any matches in the dogs table - so either returns all cats, or none. The first checks for those which exist. BTW - @MicSim 's LEFT JOIN should give you what you want – podiluska Aug 03 '12 at 14:33

confusing performance gain with nested select * not exists

5 Answers5