4

Whenever there is any description of query in front of us, we try to apply heuristics and brainstorming to construct the query.

Is there any systematic step-by-step or mathematical way to construct SQL query from a given human-readable description?

For instance, how to determine that, whether a SQL query would need a join rather than a subquery, whether it would require a group by, whether it would require a IN clause, etc....

For example, whoever studied Digital Electronics would be aware of the methods like Karnaugh Map or Quin McClausky method. These, are some systematic approaches to simplify digital logic.

If there any method like these to analyze sql queries manually to avoid brainstorming each time?

philipxy
  • 14,867
  • 6
  • 39
  • 83
user366312
  • 16,949
  • 65
  • 235
  • 452
  • Check [Relational Algebra](https://en.wikipedia.org/wiki/Relational_algebra) – Lukasz Szozda Nov 26 '15 at 21:31
  • 1
    What format would that description have? Human readable text? – usr Nov 26 '15 at 21:32
  • 2
    There is somethings you can do in different ways. So the choose between subquery or join can do the same but depending of db or data can have different performance. – Juan Carlos Oropeza Nov 26 '15 at 21:32
  • I would say you will fail: https://technet.microsoft.com/en-us/library/aa174480%28v=sql.80%29.aspx `English Query translates a natural-language question about data in the database to a set of SQL SELECT statements that can then be executed against the SQL Server database to get the answer` – Lukasz Szozda Nov 26 '15 at 21:34
  • There have definitely been attempts at English-language interfaces to databases (for instance https://technet.microsoft.com/en-us/library/aa174480(v=sql.80).aspx and http://sqlmag.com/database-administration/amazing-english-query-tool. However, natural language processing is pretty complicated. – Gordon Linoff Nov 26 '15 at 21:49
  • 1
    The problem is English is ambiguous and the SQL must not be. Any english description of a query always requires the developer to go back to the user for clarification – Nick.Mc Nov 26 '15 at 22:26
  • 1
    I just edited a link into my answer to a system that allows queries to be expressed directly in their "human-readable" form. Per my answer, it is better to think about the "meaning" of a query rather than a "description" of it. @Nick.McDermaid Although natural language is unclear we ultimately *must* map precise formal expressions to original informal ones, so some ambiguity must remain about what even our formal expressions mean. (Although we can sure of the connections between formal statements.) But it helps to reason directly in the restricted natural language that is predicate logic. – philipxy Dec 07 '15 at 22:53
  • Reading through all these comments and answers, I haven't heard much about a data model. The same question results in a completely different SQL query when used against two different data models. – Nick.Mc Dec 08 '15 at 00:02
  • 1
    @Nick.McDermaid What do you mean, "data model"? Database schema? Suppose a "predicate" maps a row and a situation to a proposition. As a "question" a query can ask "what are the rows that make a true proposition from *query predicate*?" Two different databases, ie those that have different base table predicates & hence different rows, can give the same table in answer via different query expressions for our query predicate. Or we can think of a query as asking "What do those rows state about the situation?". Then results can be different tables with the same conjunction of row propositions. – philipxy Dec 11 '15 at 02:21
  • I think you're talking logical model and I'm talking physical model. As an example one physical model of an order system might have a 'version sequence number', requiring special SQL to pick only the current record version, then joining to another table to get the definition of that version. Another data model of an order might not contain versions so it's a straight predicate from a single table to filter it out. So the same question against two different data models produces different SQL. – Nick.Mc Dec 11 '15 at 02:44
  • ... or an old style financial data model with periods in the columns vs periods in the rows.... resultant SQL for the same quesiont is completely different. The very simple question being 'what's the balance of account X' and the resultant SQL being completely different dependent on the physical data model – Nick.Mc Dec 11 '15 at 02:47
  • I feel we are simply uncovering that the original question is not specific enough :) – Nick.Mc Dec 11 '15 at 02:48
  • @Nick.McDermaid My answer & comments apply to either. There is no need to distinguish specification of ER "logical" vs "physical" (or ANSI external vs conceptual) models. I used "database" for either. Predicates characterize relation(ship)s; tables/relations represent them. If two models overlap in application then their predicates allow asserting some of the same propositions. The SQL expresses exactly "the balance of acccount X is $B". I think the OP is looking for exactly what my answer explains. The OP & you have only to try it. Of course, this particular answer is not much of a tutorial. – philipxy Dec 11 '15 at 05:10
  • I acknowledge that there are systematic methods of constructing SQL from English. However allow me to provide a simplified example: In an older crosstab type financial data model the answer might be `SELECT OpeningBalance+Jan+Feb+Mar+Apr WHERE Year=2000 `. In a newer data model the answer might be `SELECT SUM(Amount) WHERE Period BETWEEN 200001 AND 20004`. That's much more than just a predicate change. The fact is there are ambiguous and complicated data models that keep people like me busy. – Nick.Mc Dec 11 '15 at 07:07
  • 1
    @Nick.McDermaid Again, you seem to use "data model" for "database schema". Also again, the predicates determine the model/schema tables, what goes in them & what they state. Your comment's pairs of predicates, table schemas, table values, query expressions & query values are different but the query proposition (what the table tells you) is the same (asuming a "Period" value OpeningBalance). Again, it is necessary and sufficient for updating tables & interpreting queries to have a clear natural language version of each base table predicate. (I don't mean a WHERE SQL "predicate" aka condition.) – philipxy Dec 19 '15 at 09:19

2 Answers2

3

Is there any systematic step-by-step or mathematical way to construct an SQL query from a given human-readable description?

Yes, there is.

It turns out that natural language expressions and logical expressions and relational algebra (and calculus) expressions and SQL expressions (a hybrid of the last two) correspond in a rather direct way. (What follows is for no duplicate rows & no nulls.)

Each table (base or query result) has an associated predicate--a natural language fill-in-the-(named-)blanks statement template parameterized by column names.

[liker] likes [liked]

A table holds every row that, using the row's column values to fill in the (named) blanks, makes a true statement aka proposition. Here's a table with that predicate & its rows' propositions:

liker  | liked
--------------
Bob    | Dex    /* Bob likes Dex */
Bob    | Alice  /* Bob likes Alice */
Alice  | Carol  /* Alice likes Carol */

Each proposition from filling a predicate with the values from a row in a table is true. And each proposition from filling a predicate with the values from a row not in a table is false. Here's what that table says:

/*
    Alice likes Carol
AND NOT Alice likes Alice
AND NOT Alice likes Bob
AND NOT Alice likes Dex
AND NOT Alice likes Ed
...
AND Bob likes Alice
AND Bob likes Dex
AND NOT Bob likes Bob
AND NOT Bob likes Carol
AND NOT Bob likes Ed
...
AND NOT Carol likes Alice
...
AND NOT Dex likes Alice
...
AND NOT Ed likes Alice
...
*/   

The DBA gives the predicate for each base table. The SQL syntax for a table declaration is a lot like the traditional logic shorthand for the natural language version of a given predicate. Here's a declaration of a base table to hold our value:

/* (person, liked) rows where [liker] likes [liked] */
/* (person, liked) rows where Likes(liker, liked) */
CREATE TABLE Likes (
    liker ...,
    liked ...
);

Say r is the predicate of R and s is the predicate of S.

An SQL query (sub)expression transforms argument table values to a new table value holding the rows that make a true statement from a new predicate. The new table predicate can be expressed in terms of the argument table predicate(s) according to the (sub)expression's relational/table operators. A query is an SQL expression whose predicate is the predicate for the table of rows we want.

When we give a table & (possibly implicit) alias A to be joined, the operator acts on a value & predicate like the table's but with columns renamed from C,... to A.C,.... Then

  • R , S & R CROSS JOIN S are rows where r AND s

  • R WHERE condition is rows where r AND condition

  • R INNER JOIN S ON condition is rows where r AND s AND condition

  • R LEFT JOIN S ON condition is rows where (for S-only columns S1,...)

         r AND s AND condition
     OR
             r
         AND NOT FOR SOME values for S1,... [s AND condition]
         AND S1 IS NULL AND ...
    
  • SELECT DISTINCT A.C AS D,... FROM R (maybe with implicit A. and/or implicit AS D) is rows where

    • FOR SOME values for A.*,... [A.C = D AND ... AND r] (This can be less compact but looks more like the SQL.)
    • if there are no dropped columns, r with A.C,... replaced by D,...
    • if there are dropped columns, FOR SOME values for the dropped columns [ r with A.C,... replaced by D,... ]
  • (X,...) IN (R) means

    • r with columns C,... replaced by X,...
    • (X,...) IN R

Example: Natural language for (person, liked) rows where [person] is Bob and Bob likes someone who likes [liked] but who doesn't like Ed:

/* (person, liked) rows where
FOR SOME value for x,
        [person] likes [x]
    and [x] likes [liked]
    and [person] = 'Bob'
    and not [x] likes 'Ed'
*/

Rewrite using shorthand predicates:

/* (person, liked) rows where
FOR SOME value for x,
        Likes(person, x)
    AND Likes(x, liked)
    AND person = 'Bob'
    AND NOT Likes(x, 'Ed')
*/

Rewrite using only shorthand predicates of base & aliased tables:

/* (person, liked) rows where
FOR SOME values for l1.*, l2.*,
        person = l1.liker AND liked = l2.liked
    AND Likes(l1.liker, l1.liked)
    AND Likes(l2.liker, l2.liked)
    AND l1.liked = l2.liker
    AND person = 'Bob'
    AND NOT (l1.liked, 'Ed') IN Likes
*/

Rewrite in SQL:

SELECT DISTINCT l1.liker AS person, l2.liked AS liked
    /* (l1.liker, l1.liked, l2.liker, l2.liked) rows where
        Likes(l1.liker, l1.liked)
    AND Likes(l2.liker, l2.liked)
    AND l1.liked = l2.liker
    AND l1.liker = 'Bob'
    AND NOT (l1.liked, 'Ed') IN Likes
    */
FROM Likes l1
INNER JOIN Likes l2
ON l1.liked = l2.liker
WHERE l1.liker = 'Bob'
AND NOT (l1.liked, 'Ed') IN (SELECT * FROM Likes)

Similarly,

  • R UNION CORRESPONDING S is rows where r OR s

  • R UNION S is rows where r OR s with the columns of S replaced by the columns of R

  • VALUES (X,...), ... with columns C,... is rows where C = X AND ... OR ...

Example:

/* (person) rows where
    FOR SOME value for liked, Likes(person, liked)
OR  person = 'Bob'
*/
    SELECT liker AS person
    FROM Likes
UNION
    VALUES ('Bob')

So if we express our desired rows in terms of given base table natural language statement templates that rows make true or false (to be returned or not) then we can translate to SQL queries that are nestings of logic shorthands & operators and/or table names & operators. And then the DBMS can convert totally to tables to calculate the rows making our predicate true.

See How to get matching data from another SQL table for two different columns: Inner Join and/or Union? re applying this to SQL. (Another self-join.)
See Relational algebra for banking scenario for more on natural language phrasings. (In a relational algebra context.)
See Null in Relational Algebra for another presentation of relational querying.

philipxy
  • 14,867
  • 6
  • 39
  • 83
1

Here's what I do in non-grouped queries:

I put into the FROM clause the table of which I expect to receive zero or one output row per row in the table. Often, you want something like "all customers with certain properties". Then, the customer table goes into the FROM clause.

Use joins to add columns and filter rows. Joins should not duplicate rows. A join should find zero or one rows, never more. That keeps it very intuitive because you can say that "a join adds columns and filters out some rows".

Subqueries are to be avoided if a join can replace them. Joins look nicer, are more general and often are more efficient (due to common query optimizer weaknesses).

How to use WHERE and projections is easy.

usr
  • 168,620
  • 35
  • 240
  • 369
  • For grouped questions I would first build the non-grouped one. Then, adding the grouping is rather easy because the grouping columns are those that you want to have one row per unique combination of. I think the hard part is combining the tables through from and join. The rest is ore intuitive. – usr Nov 27 '15 at 08:35