GROUP BY and COUNT in PostgreSQL

Question

The query:

SELECT COUNT(*) as count_all, 
       posts.id as post_id 
FROM posts 
  INNER JOIN votes ON votes.post_id = posts.id 
GROUP BY posts.id;

Returns n records in Postgresql:

 count_all | post_id
-----------+---------
 1         | 6
 3         | 4
 3         | 5
 3         | 1
 1         | 9
 1         | 10
(6 rows)

I just want to retrieve the number of records returned: 6.

I used a subquery to achieve what I want, but this doesn't seem optimum:

SELECT COUNT(*) FROM (
    SELECT COUNT(*) as count_all, posts.id as post_id 
    FROM posts 
    INNER JOIN votes ON votes.post_id = posts.id 
    GROUP BY posts.id
) as x;

How would I get the number of records in this context right in PostgreSQL?

This would seem like an operation so common there would be an easier way. — skinkelynet, Aug 04 '12 at 09:25

Steve Jorgensen · Accepted Answer · 2012-08-04T20:43:06.477

80

I think you just need COUNT(DISTINCT post_id) FROM votes.

See "4.2.7. Aggregate Expressions" section in http://www.postgresql.org/docs/current/static/sql-expressions.html.

EDIT: Corrected my careless mistake per Erwin's comment.

edited Aug 04 '12 at 20:43

answered Aug 04 '12 at 09:25

Steve Jorgensen

11,725
1
33
43

1

PG::Error: ERROR: column "posts.id" must appear in the GROUP BY clause or be used in an aggregate function – skinkelynet Aug 04 '12 at 14:53
2

@skinkelynet: that's because the answer is subtly wrong - it has to be `FROM votes`. I added the correct form to my answer. – Erwin Brandstetter Aug 04 '12 at 15:03
@LostCrotchet It turns out you can do that in PostgreSQL. You need to put the list of fields in parentheses, so for example… `SELECT COUNT(DISTINCT (firstname, lastname)) FROM people`. – Steve Jorgensen Oct 14 '20 at 22:23

Erwin Brandstetter · Answer 2 · 2019-09-03T12:36:04.283

There is also EXISTS:

SELECT count(*) AS post_ct
FROM   posts p
WHERE  EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);

In Postgres and with multiple entries on the n-side like you probably have, it's generally faster than count(DISTINCT post_id):

SELECT count(DISTINCT p.id) AS post_ct
FROM   posts p
JOIN   votes v ON v.post_id = p.id;

The more rows per post there are in votes, the bigger the difference in performance. Test with EXPLAIN ANALYZE.

count(DISTINCT post_id) has to read all rows, sort or hash them, and then only consider the first per identical set. EXISTS will only scan votes (or, preferably, an index on post_id) until the first match is found.

If every post_id in votes is guaranteed to be present in the table posts (referential integrity enforced with a foreign key constraint), this short form is equivalent to the longer form:

SELECT count(DISTINCT post_id) AS post_ct
FROM   votes;

May actually be faster than the EXISTS query with no or few entries per post.

The query you had works in simpler form, too:

SELECT count(*) AS post_ct
FROM  (
    SELECT FROM posts 
    JOIN   votes ON votes.post_id = posts.id 
    GROUP  BY posts.id
    ) sub;

Benchmark

To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:

Test setup

Fake a typical post / vote situation:

CREATE SCHEMA y;
SET search_path = y;

CREATE TABLE posts (
  id   int PRIMARY KEY
, post text
);

INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int)  -- random text
FROM   generate_series(1,10000) g;

DELETE FROM posts WHERE random() > 0.9;  -- create ~ 10 % dead tuples

CREATE TABLE votes (
  vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);

INSERT INTO votes (post_id, up_down)
SELECT g.* 
FROM  (
   SELECT ((random()* 21)^3)::int + 1111 AS post_id  -- uneven distribution
        , random()::int::bool AS up_down
   FROM   generate_series(1,70000)
   ) g
JOIN   posts p ON p.id = g.post_id;

All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE ant took the best of five on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.

As is.
After ..
```
ANALYZE posts;
ANALYZE votes;
```
After ..
```
CREATE INDEX foo on votes(post_id);
```

After ..

VACUUM FULL ANALYZE posts;
CLUSTER votes using foo;

`count(*) ... WHERE EXISTS`

253 ms
220 ms
85 ms -- winner (seq scan on posts, index scan on votes, nested loop)
85 ms

`count(DISTINCT x)` - long form with join

354 ms
358 ms
373 ms -- (index scan on posts, index scan on votes, merge join)
330 ms

`count(DISTINCT x)` - short form without join

164 ms
164 ms
164 ms -- (always seq scan)
142 ms

Best time for original query in question:

353 ms

For simplified version:

348 ms

@wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:

366 ms

Index-only scans in the upcoming PostgreSQL 9.2 can improve the result for each of these queries, most of all for EXISTS.

Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):

Select first row in each GROUP BY group?

@a_horse_with_no_name: "more portable" was nonsense, really. Removed that bit, thanks for pointing out. I was under the wrong impression that SQLite would not support `DISTINCT` in aggregate functions. [Turns out, it does](http://www.sqlite.org/lang_aggfunc.html) - just as all other major RDBMS. As compensation (and because I wanted to clarify that for myself) I elaborate on the performance angle with a benchmark. — Erwin Brandstetter, Aug 04 '12 at 18:57
If I read correctly, you missed my CTE-version. It should be equivalent to a subquery, though. — wildplasser, Aug 05 '12 at 00:23
@wildplasser: Sorry, recreated the scenario (not identical, but close as can be seen from the setup) and added the result for the CTE version. As expected, a CTE doesn't help performance here. — Erwin Brandstetter, Aug 05 '12 at 00:52

score 13 · Answer 3 · answered Dec 15 '16 at 07:48

13

Using OVER() and LIMIT 1:

SELECT COUNT(1) OVER()
FROM posts 
   INNER JOIN votes ON votes.post_id = posts.id 
GROUP BY posts.id
LIMIT 1;

answered Dec 15 '16 at 07:48

Nick

9,735
7
59
89

1

This is what worked for my case since I wanted to filter out things with a `HAVING SUM(..) > 5` clause that summed values across rows. – stefansundin Oct 19 '21 at 21:47

score 2 · Answer 4 · answered Aug 04 '12 at 15:13

2

WITH uniq AS (
        SELECT DISTINCT posts.id as post_id
        FROM posts
        JOIN votes ON votes.post_id = posts.id
        -- GROUP BY not needed anymore
        -- GROUP BY posts.id
        )
SELECT COUNT(*)
FROM uniq;

answered Aug 04 '12 at 15:13

wildplasser

43,142
8
66
109

score 1 · Answer 5 · answered Nov 03 '21 at 18:07

For followers, I like the OP's inner query method:

SELECT COUNT(*) FROM (
    SELECT COUNT(*) as count_all, posts.id as post_id 
    FROM posts 
    INNER JOIN votes ON votes.post_id = posts.id 
    GROUP BY posts.id
) as x;

Since then you can use HAVING in there as well:

SELECT COUNT(*) FROM (
    SELECT COUNT(*) as count_all, posts.id as post_id 
    FROM posts 
    INNER JOIN votes ON votes.post_id = posts.id 
    GROUP BY posts.id HAVING count(*) > 1
) as x;

Or the equivalent CTE

with posts_coalesced as (
     SELECT COUNT(*) as count_all, posts.id as post_id 
        FROM posts 
        INNER JOIN votes ON votes.post_id = posts.id 
        GROUP BY posts.id )

select count(*) from posts_coalesced;

GROUP BY and COUNT in PostgreSQL

5 Answers5

Benchmark

Test setup

`count(*) ... WHERE EXISTS`

`count(DISTINCT x)` - long form with join

`count(DISTINCT x)` - short form without join

Linked

Related

GROUP BY and COUNT in PostgreSQL

5 Answers5

Benchmark

Test setup

count(*) ... WHERE EXISTS

count(DISTINCT x) - long form with join

count(DISTINCT x) - short form without join

Linked

Related

`count(*) ... WHERE EXISTS`

`count(DISTINCT x)` - long form with join

`count(DISTINCT x)` - short form without join