SQL Query optimization - distributive law of natural join and difference

Question

I am interested whether the following equivalence holds:

NaturalJoin (R,S-T) equivalence Difference(NaturalJoin(R,S),NaturalJoin(R,T))

If so, could you give a reason for the equivalence? And if you know what query could be more optimal in a sense of runtime, that would be really helpful.

P.S. I wanted to use LATEX but am fairly new to stackoverflow and I cannot seem to get my head around how to use it here--markup in math.stackexchange would be just \[...\].

I don't believe these are equivalent. You should also avoid `natural join`, because the keys are not explicitly defined -- and that can cause problems. — Gordon Linoff, Jul 15 '17 at 20:26
Well at first I thought they are not equivalent too but I wrote down two examples and they worked. Do you have an idea where it could break, or may it be that there are no explicitly defined keys as a potential reason?! — Abbraxas, Jul 15 '17 at 20:32

philipxy · Accepted Answer · 2017-07-16T23:32:24.957

NaturalJoin (R,S-T) equivalence Difference(NaturalJoin(R,S),NaturalJoin(R,T))

The general way to deal with this is to replace operator calls by their definitions.

Here is an outline assuming certain equivalences between relation expressions and the tuples they hold. One actually needs to use the equivalences to justify that one's queries return the tuples that one has been asked to get, but this is not usually explained. (One instead learns via a lot of examples and handwaving.)

S & T have the same set of attributes.
X holds rows (...) where X(...), ie (...) IN X.
NATURALJOIN(X,Y) holds rows where X(...) AND Y(...).
DIFFERENCE(X,Y) holds rows where X(...) AND NOT Y(...).

The left holds rows where:

R(...) AND (S(...) AND NOT T(...))
R(...) AND S(...) AND NOT T(...)

The right holds rows where:

(R(...) AND S(...)) AND NOT (R(...) AND T(...))
(R(...) AND S(...)) AND (NOT R(...) OR NOT T(...))
((R(...) AND S(...)) AND NOT R(...)) OR ((R(...) AND S(...)) AND NOT T(...))
(R(...) AND S(...) AND NOT R(...)) OR (R(...) AND S(...) AND NOT T(...))
R(...) AND S(...) AND NOT T(...)

So they are equivalent.

You can convert this to a proof by replacing X(...) by x IN X and using appropriate quantifications (FORALL & FORSOME/EXISTS) and set comprehensions ({variable|wff}).

Re using natural join in reasoning & SQL see this answer and its links.

And if you know what query could be more optimal in a sense of runtime, that would be really helpful.

That depends on your DMBS and its query implementation/optimization. There is no "optimal" without an execution model, cost/benefit function and that function's input arguments. Moreover "optimal" is chaotic--a tiny change in relational & physical DDL, database contents & statistics, query DML, query & update patterns and DBMS implementation can give completely different tradeoffs.

Thank you! The company I am working for uses Microsoft SQL Server and I use postgreSQL privately. I would be interested in the efficency on Microsoft's DBMS. Do you have any idea where one could find information on that? And again, thank you for your answer. — Abbraxas, Jul 16 '17 at 19:59
Database optimization/performance/efficiency/etc is a theoretical-empirical lifetime-learning topic. Google for entire books. You can begin by reading about those topics in the documentation for various implementations of various DBMSs. How about [Wikipedia](https://en.wikipedia.org/wiki/Query_optimization)? "Best" is chaotic--the tiniest change in requirements, relational & implementation DDL, DML & DBMS implementation can give completely different tradeoffs. PS Even before every comment, google. Big world out there. — philipxy, Jul 16 '17 at 22:49

SQL Query optimization - distributive law of natural join and difference

1 Answers1