Programmatically determine required series of JOIN statements in SQL

Question

In the standard database theory on referential integrity and inclusion dependencies, it's alluded to that there is a PSPACE algorithm for taking a known list of e.g. foreign key and primary key relationships plus a candidate relationship and determining if the candidate relationship is logically implied by the known list. The non-deterministic version of the algorithm is given in algorithm 9.1.6 of this book [pdf], but not any explicit deterministic version (and I'm guessing the idea isn't to somehow manually use Savitch's Theorem to find that algorithm).

A related question would be how to programmatically ask a database system a question like this:

Given the result set from Query 1 and the result set from Query 2, is there a known-in-advance chain of primary / foreign key dependencies that would allow me to join the two result sets together (possibly requring additional columns that would come from the derived chain of key relationships).

I'm thinking specifically in the context of Postgres, MSSQL, or MySQL, how to actually ask the database for this information programmatically.

I can query the sp_help and sp_fkeys stored procedures, but it's unclear how I can provide a set of columns from a given table and a set of columns from a target table, and just directly ask the database if it can derive, from outputs of things like sp_fkeys, a way to join the tables to each other.

I'm more than willing (happy even) to write code to do this myself if there are known algorithms for it (based on the theorem mentioned above, at least the math algorithm exists even if it's not part of the database systems directly).

My application is to ingest an assortment of tables with poorly maintained schemas, and do some metaprogramming during a schema discovery phase, so that there does not need to be a human-in-the-loop to try to manually search for chains of key relationships that would imply the way two given tables can be joined.

Suppose my database is just these three tables:

TableA

| a_key1  | a_key2  | a_value |
+---------+---------+---------+
|       1 |   'foo' |     300 |
|       2 |   'bar' |     400 |

TableB

| b_key1  | b_key2     | b_value |
+---------+------------+---------+
|   'foo' | 2012-12-01 |   'Bob' |
|   'bar' | 2012-12-02 |   'Joe' |


TableC

| c_key1     | c_key2  | c_value |
+------------+---------+---------+
| 2012-12-01 |     100 |     3.4 |
| 2012-12-02 |     200 |     2.7 |

And suppose that b_key1 is a foreign key into a_key2, that c_key1 is a foreign key into b_key2, and the the "key" columns of each table, together, form the primary key. It's not important whether the tables are normalized well (so, for instance, the fact that b_key1 can identify a row in TableA even though TableA ostensibly has a 2-tuple for its key -- this is unimportant and happens often with tables you inherit in the wild).

Then in this case, just by knowing the key relationships in advance, we know the following join is "supported" (in the sense that the data types and value ranges must make sense under the system of foreign keys):

select A.*, c.c_value
from TableA A
join TableB B
    on A.a_key2 = B.b_key1
join Table C
    on B.b_key2 = C.c_key1

So if a person came to the database with two separate queries:

select *
from TableA

select c_key1, c_value
from TableC

then the database system ought to be able to deduce the joins required for the query that brings these two into alignment (as implied by foreign keys, not other arbitrary ways of making them into alignment).

My question is: does this search functionality to produce a series of required joins exist already in the major SQL-compliant RDBMSs?

One can imagine following the idea discussed in this answer and then sucking that metadata into some other programming language, deconstructing it into a kind of graph of the key relationships, and then using a known graph path-finding algorithm to solve the problem. But any homebrewed implementations of this will be fraught with errors and it's the kind of problem where the algorithms can easily be written so as to be exponentially slow. By asking if this exists in a database system already, I'm just trying to leverage work done to make a smart and database-aware implementation if that work was already done.

It's very hard to understand exactly what you're trying to accomplish I'm afraid, and there were enough terms I'd never seen before on the first page of chapter 9 in the linked PDF (including "inclusion dependency") that I know I can't spare the time to get up to speed on them all. Maybe an example input and desired output would help, though. — j_random_hacker, May 17 '14 at 00:19
I added some additional comments to try to make an example of what I am trying to do. — ely, May 17 '14 at 11:45
I don't think this something "the database" should do but rather the client tool where you *write* the SQL statements. — , May 17 '14 at 12:10
Why? It's one specific form of exact inference on the known set of foreign keys. Why not go even further with your claim and say that the programmer should have to write a user defined function to perform all join-like operations? What constitutes the line you draw between what the database "should" provide? Also, this is a question about about it *does* provide, not whether it should or not. — ely, May 17 '14 at 14:16
I don't see what in your example would correspond to the "candidate relationship", but IIUC that's for a related question, not the question you're actually asking. I'm pretty sure PostgreSQL does not offer any such automatic capability, though an add-on package might. In any case, it seems what you need to do is find a subtree in the graph having V = {all tables} and E = {(u,v)| there is a FK relationship from u to v or v to u} that connects the set of tables you are interested in. The shortest such is a Steiner tree. Different trees will in general lead to different result sets though! — j_random_hacker, May 17 '14 at 14:56
The two queries at the bottom could be the candidate -- if you know the tuple of columns that will result from query 1 and the tuple of columns from query 2, those two tuples (with metadata about where they came from via the query) constitute a question "Can these column sets be joined together in one kind of way -- a way that is dependent on known dependencies?" I'd even settle for restricting to just the subset of each tuple that comes from a table's primary keys and any foreign keys. I agree with you that this is going to be a graph traversing problem which could have multiple solutions. — ely, May 18 '14 at 15:20
I hadn't realised that you want to consider only FKs that can be constructed from the columns that appear in the 2 queries. But it prompts the question: if there is a FK to a 3rd table, X, among the columns of query 1 (say), what columns of X are then made available for further joins? The "consistent" answer would seem to be "only the target columns of the FK", but then it would never be advantageous to join to an intermediate table, because the pool of columns would never expand. P.S. Write @j_random_hacker in comments, or I don't get notified. — j_random_hacker, May 19 '14 at 02:40
If the answer is instead "All X's columns become available for future joins", then it potentially becomes useful to join to intermediate tables. I'll write an answer under this assumption. — j_random_hacker, May 19 '14 at 02:47

score 2 · Answer 1 · answered May 19 '14 at 03:16

Let the set of input queries be Q, and assume that each of these queries selects a subset of columns from a single table. I'll assume that, when joining from a table t in Q to a table outside Q, we want to consider only those FKs that can be formed from the columns actually present in the query for t; but that when joining from a table outside Q to another table outside Q, we can use any FK defined between these two tables.

In this case I would suggest that what you really want to know is if there is a unique set of joins that will generate the combined column set of all input queries. And this can be tested efficiently:

Form a graph G having V = {all tables} and E = {(u,v) | there is a FK relationship from u to v or v to u}, with the added restriction that edges incident on tables in Q can only use columns present in the corresponding input queries.
Build a spanning forest F on G.
If any pair of tables in Q appear in different components of F then we know that there is no query that satisfies our goal, and we can stop.
Otherwise, there is at least 1 such query, involving some or all of the tables in the unique component X containing all tables in Q.
To get rid of pointless joins, delete all edges in X that are not on a path from some table in Q to some other table in Q (e.g. by repeatedly "nibbling away" all pendant edges to some table outside of Q, until this can no longer be done).
For each edge e that remains in this new, smaller tree X:
1. Create a copy X_e of the induced subgraph of V(X) in G, and delete the edge e from X_e.
2. Create a spanning forest F_e from X_e.
3. If F_e consists of a single component, this can only mean that there is a second (i.e. distinct) way of joining the tables in Q that avoids the edge e, so we can stop.
If we get to this point, then every edge e in X is necessary, and we can conclude that the set of joins implied by X is the unique way to produce the set of columns from the input query set Q.

score 0 · Answer 2 · answered May 17 '14 at 07:53

0

Any two tables can be joined. (The predicate of a join is the conjunction of its operands' predicates.)

So a lot of what you wrote doesn't make sense. Suggest you seek a characterization of your problem that doesn't involve whether or "the way" two tables can be joined. Maybe, whether there are likely constraints on expressions that are not implied by given constraints on operands and/or what constraints definitely hold on expressions given constraints on operands. (Including likely/inferred FDs, MVDs, JDs and candidate keys of expressions, or inclusion dependencies or FKs between expressions.) (Nb you don't need to restrict your expressions to joins. Eg maybe a table subtypes another.)

answered May 17 '14 at 07:53

philipxy

14,867
6
39
83

Whether two things have a logical relationship supporting a join dependence that can be deduced from other known dependencies is very different than merely matching values of records. They are different concepts. I think you are thinking of "join" just in the sense of matching values of records, which surely can be done whenever values (or functions of values) are compatible across the tables. That's unrelated to my question. – ely May 17 '14 at 11:30
Your own reference's section 8.1, the introduction to chapters 8 & 9 on dependencies, explains that they are relevant to design, integrity and optimization; not query meanings. Hence your unclear 'join is "supported"' and 'joins bringing into alignment', 'joins required'; the former with unclearness-acknowledging scare quotes, and all followed by unclear parenthetical qualifications. One can always join, with meaning the AND of operands' meanings. Your goal apparently involves inferring dependencies/constraints, but your couching in terms of joinablility obscures your question. – philipxy May 18 '14 at 01:56
Thank you for your opinion. While I disagree, I can appreciate the general idea of dependencies from chapters 8 and 9 (which I read thoroughly before asking this question). Yes, dependencies are not joins. Yet in real life you often want to join via subsets of columns that share known dependencies -- so it is very practical. As someone who has used SQL almost every day for years to solve business problems, I am very sure that this describes legitimate, extremely common use cases. – ely May 18 '14 at 15:16
If you change 'relationship' to 'constraint' (qv [link](http://stackoverflow.com/a/23690085/3404097)), refer rigourously to (candidate- and foreign-) key-constraints and (FD-,JD- and IND-) dependency-constraints and remove all references to joinability then we (you included) might see your goal. (Perhaps simultaneously imagine in prolog/datalog syntax to avoid SQL-associated misconceptions.) – philipxy May 18 '14 at 21:24
Thank you for suggesting these ways in which my use of the domain vocabulary might have failed to meet certain useful levels of rigour or guidelines. – ely May 19 '14 at 14:30

score 0 · Answer 3 · answered Jun 03 '14 at 11:02

It seems to me like you're asking about the mechanics of (/theoretical rules underpinning) the inference of inclusion dependency satisfaction between the results of arbitrary queries.

In the years I've been studying everything I could find about relational theory, I've never seen anything matching that description, so if my understanding of the question was correct, then my guess is your chances will be slim.

Programmatically determine required series of JOIN statements in SQL

3 Answers3