In the standard database theory on referential integrity and inclusion dependencies, it's alluded to that there is a PSPACE algorithm for taking a known list of e.g. foreign key and primary key relationships plus a candidate relationship and determining if the candidate relationship is logically implied by the known list. The non-deterministic version of the algorithm is given in algorithm 9.1.6 of this book [pdf], but not any explicit deterministic version (and I'm guessing the idea isn't to somehow manually use Savitch's Theorem to find that algorithm).
A related question would be how to programmatically ask a database system a question like this:
Given the result set from Query 1 and the result set from Query 2, is there a known-in-advance chain of primary / foreign key dependencies that would allow me to join the two result sets together (possibly requring additional columns that would come from the derived chain of key relationships).
I'm thinking specifically in the context of Postgres, MSSQL, or MySQL, how to actually ask the database for this information programmatically.
I can query the sp_help
and sp_fkeys
stored procedures, but it's unclear how I can provide a set of columns from a given table and a set of columns from a target table, and just directly ask the database if it can derive, from outputs of things like sp_fkeys
, a way to join the tables to each other.
I'm more than willing (happy even) to write code to do this myself if there are known algorithms for it (based on the theorem mentioned above, at least the math algorithm exists even if it's not part of the database systems directly).
My application is to ingest an assortment of tables with poorly maintained schemas, and do some metaprogramming during a schema discovery phase, so that there does not need to be a human-in-the-loop to try to manually search for chains of key relationships that would imply the way two given tables can be joined.
Suppose my database is just these three tables:
TableA
| a_key1 | a_key2 | a_value |
+---------+---------+---------+
| 1 | 'foo' | 300 |
| 2 | 'bar' | 400 |
TableB
| b_key1 | b_key2 | b_value |
+---------+------------+---------+
| 'foo' | 2012-12-01 | 'Bob' |
| 'bar' | 2012-12-02 | 'Joe' |
TableC
| c_key1 | c_key2 | c_value |
+------------+---------+---------+
| 2012-12-01 | 100 | 3.4 |
| 2012-12-02 | 200 | 2.7 |
And suppose that b_key1
is a foreign key into a_key2
, that c_key1
is a foreign key into b_key2
, and the the "key" columns of each table, together, form the primary key. It's not important whether the tables are normalized well (so, for instance, the fact that b_key1
can identify a row in TableA
even though TableA
ostensibly has a 2-tuple for its key -- this is unimportant and happens often with tables you inherit in the wild).
Then in this case, just by knowing the key relationships in advance, we know the following join is "supported" (in the sense that the data types and value ranges must make sense under the system of foreign keys):
select A.*, c.c_value
from TableA A
join TableB B
on A.a_key2 = B.b_key1
join Table C
on B.b_key2 = C.c_key1
So if a person came to the database with two separate queries:
select *
from TableA
select c_key1, c_value
from TableC
then the database system ought to be able to deduce the joins required for the query that brings these two into alignment (as implied by foreign keys, not other arbitrary ways of making them into alignment).
My question is: does this search functionality to produce a series of required joins exist already in the major SQL-compliant RDBMSs?
One can imagine following the idea discussed in this answer and then sucking that metadata into some other programming language, deconstructing it into a kind of graph of the key relationships, and then using a known graph path-finding algorithm to solve the problem. But any homebrewed implementations of this will be fraught with errors and it's the kind of problem where the algorithms can easily be written so as to be exponentially slow. By asking if this exists in a database system already, I'm just trying to leverage work done to make a smart and database-aware implementation if that work was already done.