Fast way to count distinct column values (using an index?)

Question

The problem: query taking too long

I have a new table that looks like this, with 3e6 rows:

CREATE TABLE everything_crowberry (
    id             SERIAL  PRIMARY KEY,
    group_id       INTEGER,
    group_type     group_type_name,
    epub_id        TEXT,
    reg_user_id    INTEGER,
    device_id      TEXT,
    campaign_id    INTEGER,
    category_name  TEXT,
    instance_name  TEXT,
    protobuf       TEXT,
    UNIQUE (group_id, group_type, reg_user_id, category_name, instance_name)
);

This generally makes sense for my context, and most queries are acceptably fast.

But what's not fast are queries like this:

analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=392177.29..392177.30 rows=1 width=4) (actual time=8909.698..8909.699 rows=1 loops=1)
   ->  Seq Scan on everything_crowberry  (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.461..6347.272 rows=3198583 loops=1)
 Planning time: 0.063 ms
 Execution time: 8909.730 ms
(4 rows)

Time: 8910.110 ms
analytics_staging=> select count(distinct group_id) from everything_crowberry;
 count 
-------
   481

Time: 8736.364 ms

I did create an index on group_id, but while that index is used on a WHERE clause, it is not being used in the above. So I conclude I've misunderstood something about how postgres uses indices. Note (query result) that there are under 500 distinct group_id's.

CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);

Any pointers what I've misunderstood or how to make this particular query go faster?

Updates

To help with questions posed in the comments, I've added the suggested changes here. For future readers, I've included the detail to better understand how this gets debugged.

I notice in playing about that most of the time is taken up in the initial aggregate.

seqscan

Turning off seqscan makes this much worse:

analytics_staging=> set enable_seqscan = false;

analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=444062.28..444062.29 rows=1 width=4) (actual time=38927.323..38927.323 rows=1 loops=1)
   ->  Bitmap Heap Scan on everything_crowberry  (cost=51884.99..436065.82 rows=3198583 width=4) (actual time=458.252..36167.789 rows=3198583 loops=1)
         Heap Blocks: exact=35734 lossy=316446
         ->  Bitmap Index Scan on everything_crowberry_group  (cost=0.00..51085.35 rows=3198583 width=0) (actual time=448.537..448.537 rows=3198583 loops=1)
 Planning time: 0.064 ms
 Execution time: 38927.971 ms

Time: 38930.328 ms

WHERE can make it worse

Restricting to a very small set of group id's makes it even worse, whereas I might have thought it would be easier to count a smaller set of things.

analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry WHERE group_id > 380;
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=385954.43..385954.44 rows=1 width=4) (actual time=13438.422..13438.422 rows=1 loops=1)
   ->  Bitmap Heap Scan on everything_crowberry  (cost=18742.95..383451.68 rows=1001099 width=4) (actual time=132.571..12673.233 rows=986572 loops=1)
         Recheck Cond: (group_id > 380)
         Rows Removed by Index Recheck: 70816
         Heap Blocks: exact=49632 lossy=79167
         ->  Bitmap Index Scan on everything_crowberry_group  (cost=0.00..18492.67 rows=1001099 width=0) (actual time=120.816..120.816 rows=986572 loops=1)
               Index Cond: (group_id > 380)
 Planning time: 1.294 ms
 Execution time: 13439.017 ms
(9 rows)

Time: 13442.603 ms

explain(analyze, buffers)

analytics_staging=> explain(analyze, buffers) select count(distinct group_id) from everything_crowberry;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=392177.29..392177.30 rows=1 width=4) (actual time=7329.775..7329.775 rows=1 loops=1)
   Buffers: shared hit=16283 read=335912, temp read=4693 written=4693
   ->  Seq Scan on everything_crowberry  (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.224..4615.015 rows=3198583 loops=1)
         Buffers: shared hit=16283 read=335912
 Planning time: 0.089 ms
 Execution time: 7329.818 ms

Time: 7331.084 ms

work_mem is too small (cf. explain(analyze, buffers) above)

Increasing it from the default 4 MB to 10 MB provides a bit of improvement, from 7300 ms to 5500 ms or so.

Changing SQL helps a bit, too.

analytics_staging=> EXPLAIN(analyze, buffers) SELECT group_id FROM everything_crowberry GROUP BY group_id;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=392177.29..392181.56 rows=427 width=4) (actual time=4686.525..4686.612 rows=481 loops=1)
   Group Key: group_id
   Buffers: shared hit=96 read=352099
   ->  Seq Scan on everything_crowberry  (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.034..4017.122 rows=3198583 loops=1)
         Buffers: shared hit=96 read=352099
 Planning time: 0.094 ms
 Execution time: 4686.686 ms

Time: 4687.461 ms

analytics_staging=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=392177.29..392181.56 rows=427 width=4) (actual time=5536.151..5536.262 rows=481 loops=1)
   Group Key: group_id
   Buffers: shared hit=128 read=352067
   ->  Seq Scan on everything_crowberry  (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.030..4946.024 rows=3198583 loops=1)
         Buffers: shared hit=128 read=352067
 Planning time: 0.074 ms
 Execution time: 5536.321 ms

Time: 5537.380 ms

analytics_staging=> SELECT count(*) FROM (SELECT 1 FROM everything_crowberry GROUP BY group_id) ec;
 count 
-------
   481

Time: 4927.671 ms

Creating a view is a major win, but maybe creates performance issues elsewhere.

analytics_production=> CREATE VIEW everything_crowberry_group_view AS select distinct group_id, group_type FROM everything_crowberry;
CREATE VIEW
analytics_production=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry_group_view;
                                                                                                           QUERY PLAN                                                                                                            
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.56..357898.89 rows=200 width=4) (actual time=0.046..1976.882 rows=447 loops=1)
   Buffers: shared hit=667230 read=109291 dirtied=108 written=988
   ->  Subquery Scan on everything_crowberry_group_view  (cost=0.56..357897.19 rows=680 width=4) (actual time=0.046..1976.616 rows=475 loops=1)
         Buffers: shared hit=667230 read=109291 dirtied=108 written=988
         ->  Unique  (cost=0.56..357890.39 rows=680 width=8) (actual time=0.044..1976.378 rows=475 loops=1)
               Buffers: shared hit=667230 read=109291 dirtied=108 written=988
               ->  Index Only Scan using everything_crowberry_group_id_group_type_reg_user_id_catego_key on everything_crowberry  (cost=0.56..343330.63 rows=2911953 width=8) (actual time=0.043..1656.409 rows=2912005 loops=1)
                     Heap Fetches: 290488
                     Buffers: shared hit=667230 read=109291 dirtied=108 written=988
 Planning time: 1.842 ms
 Execution time: 1977.086 ms

Can you run an `explain (analyze, buffers)` instead of just `explain (analyze)`? Maybe use `set track_io_timing=on` before that, so that we can see the I/IO timings as well — , Aug 19 '19 at 15:58
Done, question updated. I see now that the bulk of the time is taking place in the aggregate, not the seqscan, so the usage or not of an index is perhaps a red herring. — jma, Aug 19 '19 at 16:08
"*temp read=4693 written=4693*" indicates that there isn't enough `work_mem` to do the aggregation in memory. — , Aug 19 '19 at 16:09
What is your Postgres version? I would have expected a Parallel Seq Scan and a Parallel Hash Aggregate for that size of the table — , Aug 19 '19 at 16:09
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198141/discussion-between-jma-and-a-horse-with-no-name). — jma, Aug 19 '19 at 16:12

Erwin Brandstetter · Accepted Answer · 2019-08-19T17:31:15.970

8

For relatively few distinct values in group_id (many rows per group) - as seems to be your case:

3e6 rows / under 500 distinct group_id's

To make this fast, you would need an index skip scan (a.k.a. loose index scan). That's not implemented up to Postgres 12. But you can work around the limitation with a recursive query:

Replace:

select count(distinct group_id) from everything_crowberry;

With:

WITH RECURSIVE cte AS (
   (SELECT group_id FROM everything_crowberry ORDER BY group_id LIMIT 1)
   UNION ALL
   SELECT (SELECT group_id FROM everything_crowberry
           WHERE  group_id > t.group_id ORDER BY group_id LIMIT 1)
   FROM   cte t
   WHERE  t.group_id IS NOT NULL
   )
SELECT count(group_id) FROM cte;

I use count(group_id) instead of the slightly faster count(*) to conveniently eliminate a NULL value from the final recursion - as count(<expression>) only counts nonnull values.

Also, does not matter whether group_id can be NULL, as that's not counted by your query anyway.

Can use the index you already have:

CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);

Optimize GROUP BY query to retrieve latest row per user

For relatively many distinct values in group_id (few rows per group) - or for small tables - plain DISTINCT will be faster. Typically fastest when done in a subquery, as opposed to an added clause to the count():

SELECT count(group_id) -- or just count(*) to include possible NULL value
FROM (SELECT DISTINCT group_id FROM everything_crowberry) sub;

edited Aug 19 '19 at 17:31

answered Aug 19 '19 at 16:50

Erwin Brandstetter

605,456
145
1,078
1,228

good explanation and neat solution! I edited the query slightly to remove a reference to `col` that I assume was left over from editing – Sam Mason Aug 19 '19 at 17:13
Wow, that's really fast! Thanks! – jma Aug 19 '19 at 17:17
1

Should be several order of magnitude for your case. Oh, and for the query at hand it does not matter whether `group_id` can be NULL. (For other queries it might.) – Erwin Brandstetter Aug 19 '19 at 17:21
2

Yes, it's 6 ms. – jma Aug 19 '19 at 17:29
group_id is not NULL (the constraint was missing when the table was created, I should fix that), but it matters enormously whether I have that `WHERE t.group_id IS NOT NULL` in the query. – jma Aug 19 '19 at 17:30
2

@jma: The `WHERE t.group_id IS NOT NULL` in my rCTE is actually unrelated to NULL values in the table (though it deceivingly looks like that on a first glance). It's the break condition that stops the recursion after the correlated subquery has found no next greater `group_id`, which results in a NULL value in `t.group_id`. – Erwin Brandstetter Aug 19 '19 at 17:35

score 0 · Answer 2 · answered Aug 19 '19 at 16:37

0

I have seen issues with count(distinct) in Postgres at times.

How does this work?

select count(*)
from (select distinct group_id
      from everything_crowberry
     ) ec;

Or:

select count(*)
from (select distinct on (group_id) ec.*
      from everything_crowberry
     ) ec;

Note that the NULL handling is slightly difference, but the queries can easily be adjusted for that.

answered Aug 19 '19 at 16:37

Gordon Linoff

1,242,037
58
646
786

Nice suggestion! With work_mem increased a bit (thanks @a_horse_with_no_name) the query takes 6000 ms. Your first suggestion is 4600 ms. My GROUP BY (above) is 4900. Creating a VIEW is 2000. – jma Aug 19 '19 at 16:47