The two tables that I am querying against both have ~150 million rows each.
The following statement I terminate after it does not return for 45 minutes, so I don't know how long it would run:
select * from Cats cat
where not exists( select dog.foo,dog.bar from Dogs dog
where cat.foo = dog.foo
and cat.bar = dog.bar);
however this query executes in about 3 minutes:
select * from Cats outside
where not exists(select * from Cats cat
where exists( select dog.foo,dog.bar from Dogs dog
where cat.foo = dog.foo
and cat.bar = dog.bar)));
My question is what is going on behind the scenes that I am seeing this performance gain?
Reasoning behind returning the same result set:
The first query (slow) states to give all elements that don't exist based on the Cats table.
The second query (fast) states to give all elements that dont exist from the subset of Cats that do exist.
I expect the following query:
select dog.foo,dog.bar from Dogs dog
where cat.foo = dog.foo
and cat.bar = dog.bar
to return [A,B,C]
this is common to both functions.
My cat table has the following: [A,B,C,D,E]
I would expect the following query:
select * from Cats cat
where exists
to return [A,B,C] and the final piece:
select * from Cats outside
where not exists
to return [D,E]
UPDATE
Set notation to mathematically prove my claims (please correct me if I have used the wrong symbols):
∀ Cat (Ǝ cat ≠ Ǝdog)
For all elements in Cat, return the set containing each element of cat that does not equal an element in dog
∀ Cat (Ǝ cat = Ǝdog)
For all elements in Cat, return the set containing each element of cat that does equal an element in dog
∀ Cat (Ǝ innerCat ≠ Ǝcat)
For all elements in Cat, return the set containing each element of inner cat that does not equal an element in cat
Second update
I see that my math did not line up with my SQL.