Referring to column derived earlier in same SELECT statement to improve readability

Question

My question is supported by this dbfiddle demo.

I have a table defined like so:

CREATE TABLE status_table 
(
    base_name  text      NOT NULL
    , version    smallint  NOT NULL
    , ref_time   int       NOT NULL
    , processed  bool      NOT NULL
    , processing bool      NOT NULL
    , updated    int       NOT NULL  DEFAULT (extract(epoch from now()) / 60)
    , PRIMARY KEY (base_name, version)
);

And a SELECT query as follows:

SELECT 
    ref_time
    , MAX(updated) AS max_updated
    , COUNT(*) AS total
    , COUNT(*) FILTER (WHERE processed) AS proc
    , ROUND(COUNT(*) FILTER (WHERE processed) * 100.0 / COUNT(*), 1) AS percent
    , ROUND(ROUND(COUNT(*) FILTER (WHERE processed) * 1.0 / COUNT(*), 1) * 100) AS rounded
    , COUNT(*) FILTER (WHERE processed) = COUNT(*) AS complete
    , MAX(updated) < (ROUND(EXTRACT(epoch from now()) / 60) - 200) AS settled
    , (COUNT(*) FILTER (WHERE processed) = COUNT(*)) AND (MAX(updated) < (ROUND(EXTRACT(epoch from now()) / 60) - 200)) AS ready
FROM 
    status_table
GROUP BY 
    ref_time
ORDER BY 
    ready DESC, rounded DESC, ref_time DESC

There is a lot of repetition in the above query, with certain expressions (e.g. count(*) FILTER (WHERE processed)) appearing in more than one place. I understand that, when this query is run, it is optimised automatically so that it doesn't matter from a performance point of view, i.e. each common expression is evaluated only once even if it appears more than once.

However, from a readability point of view, I would really like something along the following lines, with each repeated expression being defined just once and then re-used:

SELECT ref_time
  , max(updated) AS max_updated
  , count(*) AS total
  , count(*) FILTER (WHERE processed) AS proc
  , round(proc * 100.0 / total, 1) AS percent
  , round(round(proc * 1.0 / total, 1) * 100) AS rounded
  , processed = total AS complete
  , round(extract(epoch from now()) / 60) AS now_mins
  , 200 AS interval_mins
  , max_updated < now_mins - interval_mins AS settled
  , complete AND settled AS ready
FROM status_table
GROUP BY ref_time
ORDER BY ready DESC, rounded DESC, ref_time DESC

Of course, this throws an error column "proc" does not exist, so it's clearly not possible. But what is the nearest equivalent of the above layout? I had a go with Common Table Expressions, defining the common expressions in one CTE and trying to refer back to them in the next, but got in a complete muddle, partly due to the presence GROUP BY I think... and I soon got the feeling that it would be more verbose and less readable despite my valiant attempts.

Is there a way to achieve better readability of the query in this context?

NB: Your `now_mins` row is missing a close bracket (both above and also in SqlFiddle). — JohnLBevan, Nov 15 '22 at 16:45
You can use a `WITH` common table expression or a subquery to first set up your base fields, then refer to them repeatedly in your outer query. — Zegarek, Nov 15 '22 at 16:56
@marc_s any reason why you edited the question to change capitalisation... my understanding is that there isn't a formal syntax and plenty of people use the capitalisation as per my original post... see e.g. [this comment](https://stackoverflow.com/questions/74329588/join-two-select-queries-with-count-and-derive-additional-column#comment131320838_74386503). — drmrbrewer, Nov 15 '22 at 17:32

Zegarek · Answer 1 · 2022-11-15T18:43:58.210

SELECT *
  , round(proc * 100.0 / total, 1)            AS percent
  , round(round(proc * 1.0 / total, 1) * 100) AS rounded
  , proc = total                              AS complete
  , max_updated < now_mins - interval_mins    AS settled
  --, complete AND settled                    AS ready
FROM (
    SELECT ref_time                           AS ref_time
      , max(updated)                          AS max_updated
      , count(*)                              AS total
      , count(*) FILTER (WHERE processed)     AS proc
      , round(extract(epoch from now()) / 60) AS now_mins
      , 200                                   AS interval_mins
    FROM status_table
    GROUP BY ref_time) a
ORDER BY --ready DESC, 
  rounded DESC, ref_time DESC

Since complete and settled are defined in the outermost query, you'd have to do go down one more subquery to be able to do complete AND settled AS ready - it'll always end up being a tradeoff between how nested/indented your query gets vs. how repetitive it is.

You can do the same using WITH

WITH 
 base_values as 
 (  SELECT ref_time                             AS ref_time
      , max(updated)                            AS max_updated
      , count(*)                                AS total
      , count(*) FILTER (WHERE processed)       AS proc
      , round(extract(epoch from now()) / 60)   AS now_mins
      , 200                                     AS interval_mins
    FROM status_table
    GROUP BY ref_time )
,level1_values as 
 (  SELECT *
       , proc = total                           AS complete
       , max_updated < now_mins - interval_mins AS settled
    FROM base_values )
SELECT *
  , round(proc * 100.0 / total, 1)              AS percent
  , round(round(proc * 1.0 / total, 1) * 100)   AS rounded
  , complete AND settled                        AS ready
FROM level1_values
ORDER BY ready DESC, rounded DESC, ref_time DESC

Note that * saves a lot of repetition at the price of altering the order of columns compared to your example. A good rule of thumb would be to allow it in each intermediate query, to then explicitly set the desired order of columns once in the final/outermost statement - if it doesn't follow the order by which they were generated.

* is also discouraged at the bottom/deepest/base level because at some point, objects you're selecting from can be altered, which can break your group by's and introduce column name ambiguities. Above that you've already guaranteed what * expands to.

Unless you use a distinct on, offset, limit and other order-dependent operations somewhere in the middle, you need an order by only the final query.

Examples added to your demo

Thanks! Just for my own future record really, [here is an updated demo](https://dbfiddle.uk/uMV0oJsW) with columns named explicitly in the final `SELECT` to recreate the result of the original query exactly. I'm undecided whether it's more or less readable overall than the original... maybe a bit more readable, but less compact... it's a shame it's not possible to do it without a series of CTEs, i.e. like the (failing) query in my question. Thanks for the input! — drmrbrewer, Nov 15 '22 at 18:01
@drmrbrewer It's easier to maintain: if you make a mistake, you only need to fix it once without having to find each place you copied it to and fix it there as well. It's also easier to decipher, because you don't need to read into each line to decide if it's copied and referring to the same thing, or if there's a single character difference that would make it something else. — Zegarek, Nov 15 '22 at 18:12
@drmrbrewer As to the readability: there's always some bias, as you just witnessed when your question got edited likely because someone decided your version was less readable, "less correctly structured, indented and capitalized". You may find those subqueries and CTEs less readable simply because you didn't write them, and you'd write them slightly differently. You might find them less clear if you don't usually use CTEs and subqueries just for aliasing. That I'd say is more about individual habits, one's usual first choice of tools, tried, tested and reliable. — Zegarek, Nov 15 '22 at 18:28
At the end of the day, however you decide to avoid repeating code, it'll always improve its readability and maintainability :) — Zegarek, Nov 15 '22 at 18:30

Referring to column derived earlier in same SELECT statement to improve readability

1 Answers1