JOIN two SELECT queries with COUNT and derive additional column

Question

I have a table with the following structure:

 name | version | processed | processing | updated  | ref_time 
------+---------+-----------+------------+----------+----------
 abc  |       1 | t         | f          | 27794395 | 27794160
 def  |       1 | t         | f          | 27794395 | 27793440
 ghi  |       1 | t         | f          | 27794395 | 27793440
 jkl  |       1 | f         | f          | 27794395 | 27794160
 mno  |       1 | t         | f          | 27794395 | 27793440
 pqr  |       1 | f         | t          | 27794395 | 27794160

I can use the following query to count the total number within each ref_time:

SELECT ref_time, COUNT (*) AS total
FROM (SELECT * FROM status_table) AS _
GROUP BY ref_time;

 ref_time | total 
----------+-------
 27794160 |  2259
 27793440 |  2259

And the following query to count the total number within each ref_time where processed=true:

SELECT ref_time, COUNT (*) AS processed FROM (SELECT * FROM status_table WHERE processed=true) AS _ GROUP BY ref_time;

 ref_time | processed 
----------+-----------
 27794160 |      1057
 27793440 |      2259

I then try to merge the information using an INNER JOIN on ref_time:

SELECT * FROM
(SELECT ref_time, COUNT (*) AS total
 FROM (SELECT * FROM status_table) AS _ 
 GROUP BY ref_time) result_total
INNER JOIN
(SELECT ref_time, COUNT (*) AS processed
 FROM (SELECT * FROM status_table WHERE processed=true) AS _
 GROUP BY ref_time) result_processed
ON result_total.ref_time = result_processed.ref_time;

 ref_time | total | ref_time | processed 
----------+-------+----------+-----------
 27794160 |  2259 | 27794160 |      1057
 27793440 |  2259 | 27793440 |      2259

First question: how do I avoid the duplicated ref_time column?

Second question: how do I add an additional percent column derived as (100 * processed / total) (to one d.p.), i.e. to give:

 ref_time | total | processed | percent 
----------+-------+-----------+---------
 27794160 |  2259 |      1057 |    46.8
 27793440 |  2259 |      2259 |   100.0

Third question: is there a more efficient way to do this? Can I avoid making two separate SELECT queries?

score 2 · Accepted Answer · answered Nov 05 '22 at 22:40

Postgres has expressive aggregate functions.

To do the conditional count, we can use the standard filter clause directly against column processed, which is a boolean. As for the percentage (or ratio), we can cast the boolean to an integer (which yields 0 or 1, as you would expect), and take the average of that.

So:

select ref_time, 
    count(*) cnt_total,
    count(*) filter(where processed) cnt_processed,
    avg(processed::int) ratio_processed
from mytable
group by ref_time

Here is a demo with your sample data, which returns:

ref_time	cnt_total	cnt_processed	ratio_processed
27794160	3	1	0.33333333333333333333
27793440	3	3	1.00000000000000000000

score 1 · Answer 2 · answered Nov 05 '22 at 16:50

1

SELECT ref_time,count(*)as total,
SUM
(
  CASE
   WHEN processed='t' then 1
   else 0
  END
)processed
FROM YOUR_TABLE
GROUP BY ref_time

answered Nov 05 '22 at 16:50

Sergey

4,719
1
6
11

Thanks! What about the second question, i.e. adding a derived `percent` column to the result? – drmrbrewer Nov 05 '22 at 16:55
1

The CASE Expression can be simplified to `count(*) filter (where processed)` – Nov 05 '22 at 17:46

trillion · Answer 3 · 2022-11-05T17:49:52.813

1

with main as (
    select
    ref_time,
    sum(case when processed = 'true' then 1 else 0 end ) as total_processed,
    count(*) as total
    
    from <table_name>
    group by 1
)
select *, round((total_processed::numeric / nullif(total::numeric,0))  * 100),2) as percent from main

edited Nov 05 '22 at 17:49

answered Nov 05 '22 at 17:01

trillion

1,207
1
5
15

Thanks! I get `ERROR: aggregate functions are not allowed in GROUP BY` – drmrbrewer Nov 05 '22 at 17:04
I get the same error as before :-/ – drmrbrewer Nov 05 '22 at 17:10
@drmrbrewer it should work now, i changed the issue just now :) try again – trillion Nov 05 '22 at 17:12
@drmrbrewer can you confirm if its working now ? – trillion Nov 05 '22 at 17:16
It doesn't give an error now, but the percent is `0` even when `total_processed=1057` and `total=2259` – drmrbrewer Nov 05 '22 at 17:18
@drmrbrewer can you share the code with me which you are running ? all the percent values are 0 or just one case ? – trillion Nov 05 '22 at 17:37
@drmrbrewer: try `total_processed::numeric / ....` – Nov 05 '22 at 17:47
@drmrbrewer yess thats the type issue, should be fine now – trillion Nov 05 '22 at 17:50
i added the `round()` so you don't have many decimal places, is it fine for you now ?, you can also reduce the length of code by moving everything to one select statement – trillion Nov 05 '22 at 17:52

score 1 · Answer 4 · answered Nov 10 '22 at 09:07

1

SELECT ref_time
     , count(*) AS total
     , count(*) FILTER (WHERE processed) AS processed
     , round(count(*) FILTER (WHERE processed) * 100.0 / count(*), 2) AS percent
FROM   status_table
GROUP  BY 1;

fiddle

Returning actual, rounded percentage values.

avg(processed::int), like the currently accepted answer does, is smart but the additional aggregate function plus involved casting adds cost. (The repeated count(*) is only computed once.) This is typically faster.

About the aggregate FILTER clause:

Aggregate columns with additional (distinct) filters

Aside: "name" is not a good name. Reconsider.

answered Nov 10 '22 at 09:07

Erwin Brandstetter

605,456
145
1,078
1,228

Thanks for the additional solution and insight. +1 for "name is not a good name" :-) in my actual table, this column is called `base_name` which I hope is more acceptable! Regarding "actual, rounded percentages" I was already using `ROUND(AVG(processed::int) * 100, 1) percent` based on the currently accepted answer. Regarding "the repeated `count(*)` is only computed once... of course I believe you but is it possible to demonstrate this somehow... just curious. Thanks. – drmrbrewer Nov 10 '22 at 09:49
And of course it's not just `count(*)` that is repeated, but also `count(*) FILTER (WHERE processed)`. – drmrbrewer Nov 10 '22 at 09:53
Incidentally (I'm trying to settle on a consistent convention to use for myself)... why do you not capitalise `count()` even though you capitalise other keywords like `FILTER` and `WHERE`? – drmrbrewer Nov 10 '22 at 10:06
1

To demonstrate, compare `EXPLAIN ANALYZE SELECT count(*) FROM big_table` with `EXPLAIN ANALYZE SELECT count(*), count(*), count(*), count(*), count(*), count(*) FROM big_table` and `EXPLAIN ANALYZE SELECT count(*), avg(some_column) FROM big_table`. Same applies to `count(*) FILTER (WHERE ...)`. – Erwin Brandstetter Nov 10 '22 at 10:06
1

I capitalize syntax elements and key words of SQL, but not functions (incl. aggregate and window functions). That turned out to work best for me, visually. The distinction gets a bit blurry sometimes, as some "functions" like `COALESCE()` or `CURRENT_TIMESTAMP` are somewhat in between. – Erwin Brandstetter Nov 10 '22 at 10:08
Thanks re `EXPLAIN ANALYZE`. And re capitalisation... I can see your logic, because I'm finding that by capitalising everything (except column and table names) it all looks a bit shouty and difficult to read. – drmrbrewer Nov 10 '22 at 10:17
1

Also, identifiers (including function names), unlike syntax elements, can be double-quoted, and then capitalization makes an actual difference. Unquoted function names are lower-cased by Postgres before proceeding. So my format is "natural". – Erwin Brandstetter Nov 10 '22 at 10:20
I see, it's nice to have some underlying logic, rather than just being arbitrary/stylistic. So a function name can be *considered to be* an identifier... or *actually is* an identifier? I can see that it can be double quoted (like an identifier) but is it *actually* an identifier? – drmrbrewer Nov 10 '22 at 10:54
1

A function name *is* an identifier. Related: https://stackoverflow.com/a/20880247/939860 – Erwin Brandstetter Nov 10 '22 at 11:25
Thanks... I'd already upvoted that related post of yours, so I must have stumbled across it (and found it useful) in the dim and distant past :-) – drmrbrewer Nov 10 '22 at 11:31
Regarding repeated functions only being computed once, I suppose that you can re-write the query along the following lines so that the function is not repeated as such: https://dbfiddle.uk/DblaQCZU ... is that a valid approach? Performance is probably the same, but maybe more readable to define stuff and then refer back to those? – drmrbrewer Nov 10 '22 at 15:44
Compared to: https://dbfiddle.uk/83vb0HSe – drmrbrewer Nov 10 '22 at 15:49
2

Certainly valid, especially for multiple repetitions. Or just with a simpler subquery. My query without either is typically slighly faster. – Erwin Brandstetter Nov 10 '22 at 17:09

JOIN two SELECT queries with COUNT and derive additional column

4 Answers4

Linked