4

The two tables that I am querying against both have ~150 million rows each.

The following statement I terminate after it does not return for 45 minutes, so I don't know how long it would run:

select * from Cats cat  
where not exists( select dog.foo,dog.bar from Dogs dog
                  where cat.foo = dog.foo   
                  and cat.bar = dog.bar);

however this query executes in about 3 minutes:

select * from Cats outside  
   where not exists(select * from Cats cat  
                     where exists( select dog.foo,dog.bar from Dogs dog
                      where cat.foo = dog.foo   
                      and cat.bar = dog.bar)));

My question is what is going on behind the scenes that I am seeing this performance gain?

Reasoning behind returning the same result set:

The first query (slow) states to give all elements that don't exist based on the Cats table.

The second query (fast) states to give all elements that dont exist from the subset of Cats that do exist.

I expect the following query:

select dog.foo,dog.bar from Dogs dog
                          where cat.foo = dog.foo   
                          and cat.bar = dog.bar  

to return [A,B,C]

this is common to both functions.

My cat table has the following: [A,B,C,D,E]

I would expect the following query:

 select * from Cats cat  
                     where exists

to return [A,B,C] and the final piece:

select * from Cats outside  
       where not exists

to return [D,E]

UPDATE

Set notation to mathematically prove my claims (please correct me if I have used the wrong symbols):

∀ Cat (Ǝ cat ≠ Ǝdog)    

For all elements in Cat, return the set containing each element of cat that does not equal an element in dog

∀ Cat (Ǝ cat = Ǝdog)   

For all elements in Cat, return the set containing each element of cat that does equal an element in dog

∀ Cat (Ǝ innerCat ≠ Ǝcat)  

For all elements in Cat, return the set containing each element of inner cat that does not equal an element in cat

Second update

I see that my math did not line up with my SQL.

Woot4Moo
  • 23,987
  • 16
  • 94
  • 151

5 Answers5

3

Apparently, NOT IN and NOT EXISTS are problematic for data engines to optimize. Technically, these are referred to as anti-joins (as distinguished from equi-joins, semi-joins, non-equijoins, and so on).

When a join is hard to optimize, engines resort to nested loop joins. These are usually the worst performing kind (although in SQL Server execution plans, these often look the same, because SQL Server calls index look ups "nested loop" in the execution plan).

What is the difference between these two queries? The first only has a NOT EXISTS, so it is probably doing something inefficient. The second is doing an EXISTS on the inner most subquery. This gets optimized first, basically as a join. If the keys have indexes all is well. SQL Server could also choose hash-based or merge-based algorithms for these.

The "not exists" in the second version is based on the same table. This might give SQL Server more room for optimization.

Finally, the second version might be reducing the set of data considerably. If so, even a nested loop join on the outside might go much faster.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

The second query is far more optimal when executed, and this is why:

You alias the outer query's Cats table to outside, but you don't refer to outside in the clause of your where not exists. So, SQL can do the following:

  • find any single cat where cat.foo = dog.foo and cat.bar = dog.bar (from your innermost query)
  • this means that there does exist a cat that meets your outer where not exists for all cats in outside
  • therefore the where not exists clause is false for all rows in outside
  • therefore the result for the query is empty

Your first query has to re-execute the nested query for every cat in the table, and is therefore slower.

Dan Puzey
  • 33,626
  • 4
  • 73
  • 96
  • I saw my issue was that my math (stated above) did not identically match my query. This helped me to see what I was doing wrong. – Woot4Moo Aug 03 '12 at 15:31
  • Is there a way to avoid re-executing the inner query? It would be beneficial if at most I have to do that query one time. – Woot4Moo Aug 03 '12 at 16:27
1

The answer to your question would be to check the execution plans.

As a sidenote, you should try this equivalent query (also see https://stackoverflow.com/a/1069467/44522):

SELECT * FROM Cats cat LEFT OUTER JOIN Dogs dog 
    ON cat.foo = dog.foo and cat.bar = dog.bar
WHERE dog.foo IS NULL and dog.bar IS NULL

I bet it will execute way faster (assuming you got the right indexes).

Community
  • 1
  • 1
MicSim
  • 26,265
  • 16
  • 90
  • 133
  • Do you need the (and dog.bar is null) ? – podiluska Aug 03 '12 at 14:28
  • Yes. You could have dog.foo = NULL and dog.bar = 'xyz', which would be in the result set, if you don't explicitly exclude it. Admitted, it's a very unlikely situation if you are joining over primary/foreign keys. – MicSim Aug 03 '12 at 14:32
0

I have discovered through testing that this is the most efficient way of performing the queries in the initial question:

Select cat.foo,cat.bar from cats cat

MINUS

Select dog.foo,dog.bar from dogs dog

This works because none of my columns are nullable.

Woot4Moo
  • 23,987
  • 16
  • 94
  • 151
-1

They're different queries with different results. To make the second return the same as the first, it needs to be something like...

   select * from cats outside   
   where not exists(select * from Cats cat   
                     where exists( select dog.foo,dog.bar from Dogs dog 
                      where cat.foo = dog.foo    
                      and cat.bar = dog.bar)
                      and outside.foo = cat.foo
                      and outside.bar=cat.bar
                      ) 
podiluska
  • 50,950
  • 7
  • 98
  • 104
  • The second query checks for any matches in the dogs table - so either returns all cats, or none. The first checks for those which exist. BTW - @MicSim 's LEFT JOIN should give you what you want – podiluska Aug 03 '12 at 14:33