2

This question comes from my readings of C.J Date's SQL and Relational Theory: How to Write Accurate SQL Code and looking up about joins on the internet (which includes coming across multiple posts here on NATURAL JOINs (and about SQL Server's lack of support for it))

So here is my problem...

On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred).

On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction).

Is the reconciliation of these that:

  • Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS.

and / or

  • Good/Better table design should remove/minimise the problems that natural join creates.

?

nvogel
  • 24,981
  • 1
  • 44
  • 82
Portablejim
  • 824
  • 1
  • 12
  • 21
  • [Inner Join vs Natural Join vs USING clause: are there any advantages?](https://stackoverflow.com/a/35034568/3404097) – philipxy Dec 15 '19 at 03:11

4 Answers4

6

First, the choice between theory and being practical is a fallacy. To quote Chris Date: "the truth is that theory--at least the theory I'm talking about here, which is relational theory--is most definitely very practical indeed".

Second, consider that natural join relies on attribute naming. Please (re)read the following sections of the Accurate SQL Code book:

6.12. The Reliance on Attribute Names. Salient quote:

The operators of the relational algebra… all rely heavily on attribute naming.

3.9. Column Naming in SQL. Salient quote:

Strong recommendation: …if two columns in SQL represent "the same kind of information," give them the same name wherever possible. (That's why, for example, the two supplier number columns in the suppliers-and-parts database are both called SNO and not, say, SNO in one table and SNUM in the other.) Conversely, if two columns represent different kinds of information, it's usually a good idea to give them different names.

I'd like to address @kuru kuru pa's point (a good one too) about columns being added to a table over which you have no control, such as a "web service you're consuming." It seems to me that this problem is effectively mitigated using the strategy suggested by Date in section 3.9 (referenced above): quote:

  • For every base table, define a view identical to that base table except possibly for some column renaming.
  • Make sure the set of views so defined abides by the column naming discipline described above.
  • Operate in terms of those views instead of the underlying base tables.

Personally, I find the "natural join considered dangerous" attitude frustrating. Not wishing to sound self-righteous but my own naming convention, which follows the guidance of ISO 11179-5 Naming and identification principles, results in schema highly suited to natural join.

Sadly, natural join perhaps won't be supported anytime soon in the DBMS product I use professionally (SQL Server): the relevant feature request on Microsoft Connect is currently closed as "won't fix" despite currently having a respectable +38 / -2 score has been reopened and gained a respectable 46 / -2 score (go vote for it now :)

onedaywhen
  • 55,269
  • 12
  • 100
  • 138
  • 1
    In theory, views may be good. But at least **as *implemented*** by TSQL (and I don't have enough in-depth knowledge in this area of other dialects, but maybe someone else can speak to that), views have some significant problems, particularly when it comes to joining them with other tables or views. As for how practical natural joins are in relational theory, if you have a steril environment (i.e. 100% control over your rdbms), then OK. But in a dirty (real-world) environment (i.e., one not considered by relational theory) -- you can't afford to just assume they're always a practical choice. – Chains Sep 14 '11 at 15:45
  • 1
    @卵が好き: in theory, views are relation variables and tables are relation variables and the difference between them isn't very interesting. Whether views are implemented well in a given historical version of SQL Server (or other product for that matter) isn't directly relevant to the question. What I have tried to do, however, is make my answer relevant to the book the question refers to. – onedaywhen Apr 10 '13 at 15:05
  • if you're going to use a view to ensure datatype compatibility, matching column names, write it well enough that it doesn't cause performance to suffer (indexing, etc.), and then do that each time a need arises to consume someone else's table, or otherwise some table you've never seen before, then I don't disagree with you that you *can* do that, and that it would work very well... But it seems like it takes more effort, so the question it leaves me with is: what's the payoff / what makes it worth the extra effort? I don't think natural joins are bad -- just expensive. – Chains Apr 10 '13 at 16:42
  • 1
    @卵が好き: again, my answer is relevant to the question by referring to the _book author_'s recommendation (which sounds good to me _in theory_) to use views. I urge you to consider the book's title, "SQL and Relational Theory". Performance considerations (indexing, etc) are out of scope. – onedaywhen Apr 11 '13 at 09:26
6

a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),

"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."

This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.

"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."

Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.

"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."

While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.

Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.

"Good/Better table design should remove/minimise the problems that natural join creates."

Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...

Erwin Smout
  • 18,113
  • 4
  • 33
  • 52
2

The main problem with the NATURAL JOIN syntax in SQL is that it is typically too verbose.

In Tutorial D syntax I can very simply write a natural join as:

R{a,b,c} JOIN S{a,c,d};

But in SQL the SELECT statement needs either derived table subqueries or a WHERE clause and aliases to achieve the same thing. That's because a single "SELECT statement" is really a non-relational, compound operator in which the component operations always happen in a predetermined order. Projection comes after joins and columns in the result of a join don't necessarily have unique names.

E.g. the above query can be written in SQL as:

SELECT DISTINCT a, b, c, d
FROM
(SELECT a,b,c FROM R) R
NATURAL JOIN
(SELECT a,c,d FROM S) S;

or:

SELECT DISTINCT R.a, R.b, R.c, S.d
FROM R,S
WHERE R.a = S.a AND R.c = S.c;

People will likely prefer the latter version because it is shorter and "simpler".

nvogel
  • 24,981
  • 1
  • 44
  • 82
  • We might be talking about two different things, but as I understand Natural Joins in SQL (which I think is the context the OP has in mind), column names are implicit -- i.e., you don't specify which column joins to which column. For example: `SELECT department_name, first_name||' '||last_name FROM departments NATURAL JOIN employees;` In this case, the naturally joined columns in each table will be those with identical column-names. As soon as you explicitly specify which column joins to which column (whether via JOIN clause, or WHERE clause), you're no longer using a ***Natural*** join. – Chains Sep 14 '11 at 17:28
  • So anyway, I agree with what you say about them being too verbose. Per your first select example, even though the subqueries aren't necessary, they'd probably the best way to systematically use natural joins, as they'd give you the opportunity to alias your column-names so you could control which matched, and which didn't, allowing you to control your join. But in doing that...you're actually doing more work than if you just specified the column joins explicitly and didn't use a natural join in the first place. Yeah...verbose. – Chains Sep 14 '11 at 17:39
  • 1
    A NJ is a join between two or more relations based on common column names. Of course it makes sense always to specify the column names of the relations concerned - doing that doesn't alter the fact that it's still a NJ. The question is why that isn't a technique preferred in SQL. The answer is the poor SQL syntax you have to use. It's nothing to do with "theory not being practical" and that's why I downvoted your answer. As evidence: it's perfectly practical when using a better syntax - such as the Tutorial D query I gave which is exactly equivalent to the needlessly complex SQL one. – nvogel Sep 14 '11 at 20:32
  • David, I'm not terribly familiar with Tutorial D but shouldn't your first query be `R{a,b,c} JOIN S{a,c,d};` i.e. the `JOIN` operator in place of the comma? – onedaywhen Sep 15 '11 at 07:19
  • Your point about SQL's inherent problem of projection coming after joins is a good one. The predefined order makes SQL less flexible and verbose but I'm not sure that makes it non-relational; of course, other features do, including column ordering that you mentioned. However, I see natural join as a feature than can mitigate verbosity in other cases e.g. where projection before join is not required and the equivalent `ON` clause of a non-natural join involves many columns. – onedaywhen Sep 15 '11 at 07:29
  • I prefer your first SQL query because I find it easier to read. As regards your second SQL query, I wager that most people on Stackoverflow would admonish you for not using `INNER JOIN`! – onedaywhen Sep 15 '11 at 07:33
  • 1
    Fixed the Tutorial D syntax. Thaks onedaywhen. – nvogel Sep 15 '11 at 07:55
-2

Theory versus reality...

Natural joins are not practical.
There is no such thing as a pure (i.e. practice is idetical to theory) RDBMS, as far as I know.

I think Oracle and a few others actually support support natural joins -- TSQL doesn't.

Consider the world we live in -- chances of two tables each having a column with the same name is pretty high (like maybe [name] or [id] or [date], etc.). Maybe those chances are narrowed down a bit by grouping only those tables you might actually want to join. But regardless, without a careful examination of the table structure, you won't know if a "natural join" is a good idea or not. And even if it is, at that moment, it might not be in another year when the application gets an upgrade which adds columns to certain tables, etc., or the web service you're consuming adds fields you didn't know about, etc.

I think a "pure" system would have to be one you had 100% control over at a minimum, and then also, one that would have some good validation in the alter table / create table process that would warn / prevent you from creating a new column in some table that could be "naturally" joined to some other table you might not be intending it to be join-able to.

I guess bottom-line for me would be, valuing my sanity, wanting my applications to have maximum up-time, valuing quick/clean maintenance and upgrades, etc. -- good table design in this context means not using natural joins (ever).

Chains
  • 12,541
  • 8
  • 45
  • 62
  • 3
    Natural join is extremely useful and practical when it gives the desired results. Using some alternative syntax to achieve the same result surely just reinforces that point. It is true that the *SQL-specific syntax* for the natural join is not very helpful but that's because of the absurd over-complexity of SELECT statement construction rather than anything fundamentally wrong with natural joins. – nvogel Sep 14 '11 at 12:30
  • @dportas -- I hear you, and in some limited / specific contexts (such as 'when it gives the desired results'), I'd agree with you. But finding some practical, efficient, best-practice, forward-looking sort of approach to determining if/when a natural join will give you the desired results in any given situation is easier said than done. If you want to use a systematic / general approach to programming sql, and you don't have 100% control over your environment (such as which datasources you use and which conventions they may or may not follow), I don't see a lot of room for natural joins. – Chains Sep 14 '11 at 15:26
  • 1
    It hasn't got anything to do with naming conventions or control over the environment. You just need to know (and specify) the set of columns you want to use from each of the joined tables. – nvogel Sep 14 '11 at 15:32
  • @dportas -- are you talking about natural JOINs, or natural KEYs? – Chains Sep 14 '11 at 15:50
  • 1
    "chances of two tables..." You talk like they simply happen independently. Do tables where you work get created without any reference to each other? If so, how will any query involving multiple tables ever be expected to return a result that makes any sense? Sure there is some coherent overall design? Assuming there is, isn't there a naming convention? If there is, either it is one which facilitates natural joins or it doesn't. If it doesn't, then ok, using them will involve a lot of column renaming and may not be worth the effort. But it's not a matter of chance, it is a matter of design. – EricS Oct 11 '13 at 22:08
  • @EricS - you may live/work in a more pure / homogeneous environment that I assume most folks in larger organizations do. Your assumption that we create our own tables is what my answer attempts to challenge -- the fact of the matter is, that is hardly ever the case. We're all consuming, more and more, *other* systems' data (*other* is the key word there -- pay attention now :-). We don't have control over *other* design, and thus, in a mixed environment which we find ourselves in more and more, natural joins work less and less. Hope that helps. – Chains Oct 14 '13 at 22:54