postgresql COUNT(DISTINCT ...) very slow

Question

I have a very simple SQL query:

SELECT COUNT(DISTINCT x) FROM table;

My table has about 1.5 million rows. This query is running pretty slowly; it takes about 7.5s, compared to

 SELECT COUNT(x) FROM table;

which takes about 435ms. Is there any way to change my query to improve performance? I've tried grouping and doing a regular count, as well as putting an index on x; both have the same 7.5s execution time.

I don't think so. Getting the distinct values of 1.5 million rows is just going to be slow. — Ry-, Jun 28 '12 at 17:54
I just tried it in C#, getting the distinct values of 1.5 million *integers from memory* takes over one second on my computer. So I think you're probably out of luck. — Ry-, Jun 28 '12 at 18:01
The query plan will very much depend on the table structure (indexes) and the setting of the tuning constants (work)mem, effective_cache_size, random_page_cost). With reasonable tuning the query could possibly be executed in less than a second. — wildplasser, Jun 28 '12 at 18:10
Could you be more specific? What indexes and tuning constants would be required to get it under a second? For simplicity, assume this is a two-column table with a primary key on the first column y, and I'm doing this 'distinct' query on a second column x of type int, with 1.5 million rows. — ferson2020, Jun 28 '12 at 18:14
I am just experimenting: your yery costs me 1.7 sec; `distinct(val, count*)` costs about 400 ms. A CTE will probably help the planner. BRB. — wildplasser, Jun 28 '12 at 18:26
Again, could you be specific with what kind of CTE would help the planner? — ferson2020, Jun 28 '12 at 18:31
Two thoughts ... it could be possible to get an approximate by doing "explain select distinct val from table" and see how many rows the planner believes it to be. The other thought ... should probably be possible one way or another to find the number of distinct entries in the index itself. Unfortunately I don't have time investigating at the moment. Ah, third suggestion ... using a redundant stats table with a counter, updated through a trigger. None of the suggestions are very nice however. Having an index, it really ought to be possible to do the count relatively fast... — tobixen, Jun 28 '12 at 18:39
The CTE is more or less a trick to keep the count+distinct in different layers (and cause the "hash" plan to be used) The hashed plan needs some work_mem; setting work_mem=64; will force an index (or table) scan, which is about twice as slow. LOL, I just *proved* that posttgres is faster than C# ;-) — wildplasser, Jun 28 '12 at 18:40
With the same query, I am not using the "hash" plan; I'm getting "unique", "group", "sort". — ferson2020, Jun 28 '12 at 19:03
Please, include the table definition with all the indexes (`\d` output of `psql` is good one) and precise the column that you have problem with. It'd be good to see `EXPLAIN ANALYZE` of both queries. — vyegorov, Jun 28 '12 at 20:53

Ankur · Answer 1 · 2016-10-15T21:51:32.170

482

You can use this:

SELECT COUNT(*) FROM (SELECT DISTINCT column_name FROM table_name) AS temp;

This is much faster than:

COUNT(DISTINCT column_name)

edited Oct 15 '16 at 21:51

answered Feb 06 '13 at 15:17

Ankur

5,613
1
15
14

61

holy queries batman! This sped up my postgres count distinct from 190s to 4.5 whoa! – rogerdpack Nov 17 '14 at 22:25
33

I found this thread on [www.postgresql.org](http://www.postgresql.org) which discusses the same thing: [link](http://www.postgresql.org/message-id/CAONnt+72Mtg6kyAFDTHXFWyPPY-QRbAtuREak+64Lm1KN1c-wg@mail.gmail.com). One of the replies (by Jeff Janes) says that COUNT(DISTINCT()) sorts the table to do its work instead of using hash. – Ankur Dec 09 '14 at 10:58
6

@Ankur May I ask you question? Since `COUNT(DISTINCT())` performs sorting, it will be definitely helpful to have an index on the `column_name` especially with relatively small amount of `work_mem` (where hashing will produce relatevely large amount of batches). Since that, it's not always bad to use COUNT (DISTINCT()_, isn't? – St.Antario Oct 12 '15 at 10:39
does anyone know why these 2 queries could return different results if there is a 'NULL' value in column_name? – musmahn Sep 05 '18 at 12:21
6

@musmahn `Count(column)` only counts non null values. `count(*)` counts rows. So the first/longer one, will also count the null row (once). Change to `count(column_name)` to make them behave the same. – GolezTrol Nov 19 '18 at 19:39
2

@ankur this wasn't much useful for me..didn't get any remarkable improvement. – Channa Sep 09 '19 at 08:18
With multiple (2) columns this made almost no difference. I guess I'll have to try creating a new column that merges those two. – b15 Apr 22 '21 at 14:29
2

I checked it. No difference in PostgreSQL 13. – guogangj Aug 04 '21 at 07:17
If this is true, then the people who wrote the internal code for Postgre SQL must not be very bright. Why wouldn't they implement `count(distinct)` exactly as this alternative query, if it's so much faster? How ***else*** did they implement it? This makes no sense to me. – Sep 29 '21 at 17:44
1

Using Postgres 14, the suggestion is beneficial when working with 40-60char text. However, it is actually slower when using int8. – Achilles Dec 20 '22 at 10:05
In SQL Server both ways are equally slow. – skan Apr 26 '23 at 13:48

score 15 · Answer 2 · edited Jul 07 '17 at 06:06

-- My default settings (this is basically a single-session machine, so work_mem is pretty high)
SET effective_cache_size='2048MB';
SET work_mem='16MB';

\echo original
EXPLAIN ANALYZE
SELECT
        COUNT (distinct val) as aantal
FROM one
        ;

\echo group by+count(*)
EXPLAIN ANALYZE
SELECT
        distinct val
       -- , COUNT(*)
FROM one
GROUP BY val;

\echo with CTE
EXPLAIN ANALYZE
WITH agg AS (
    SELECT distinct val
    FROM one
    GROUP BY val
    )
SELECT COUNT (*) as aantal
FROM agg
        ;

Results:

original                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=36448.06..36448.07 rows=1 width=4) (actual time=1766.472..1766.472 rows=1 loops=1)
   ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=31.371..185.914 rows=1499845 loops=1)
 Total runtime: 1766.642 ms
(3 rows)

group by+count(*)
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=36464.31..36477.31 rows=1300 width=4) (actual time=412.470..412.598 rows=1300 loops=1)
   ->  HashAggregate  (cost=36448.06..36461.06 rows=1300 width=4) (actual time=412.066..412.203 rows=1300 loops=1)
         ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=26.134..166.846 rows=1499845 loops=1)
 Total runtime: 412.686 ms
(4 rows)

with CTE
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=36506.56..36506.57 rows=1 width=0) (actual time=408.239..408.239 rows=1 loops=1)
   CTE agg
     ->  HashAggregate  (cost=36464.31..36477.31 rows=1300 width=4) (actual time=407.704..407.847 rows=1300 loops=1)
           ->  HashAggregate  (cost=36448.06..36461.06 rows=1300 width=4) (actual time=407.320..407.467 rows=1300 loops=1)
                 ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=24.321..165.256 rows=1499845 loops=1)
       ->  CTE Scan on agg  (cost=0.00..26.00 rows=1300 width=0) (actual time=407.707..408.154 rows=1300 loops=1)
     Total runtime: 408.300 ms
    (7 rows)

The same plan as for the CTE could probably also be produced by other methods (window functions)

Have you considered the effect of caching? If doing three "explain analyze" subsequently, the first one may be slow fetching things from disk while the two latter may be fast fetching from memory. — tobixen, Jun 28 '12 at 18:45
Indeed: effective_cache_size is the first setting to tweak. Mine is 2GB, IIRC. — wildplasser, Jun 28 '12 at 18:46
I set my effective_cache_size to 2GB, with no change in performance. Any other settings you'd suggest tweaking? If so, to what? — ferson2020, Jun 28 '12 at 18:59
1) *how* did you set it? (did you HUP it?) 2) Do you actually have that much memory available? 3) show us your plan. 4) maybe my machine is faster, or yours has more concurrent load to deal with. @ferson2020: Ok — wildplasser, Jun 28 '12 at 19:03
I set it with the statement: SET effective_cache_size='2GB'; I do have that much memory available. I tried including my query plan, but it won't fit in the comment box. — ferson2020, Jun 28 '12 at 19:05
1) what is your estimated row_width? 2) do you have a usable index on the `distinct x)` column? 3) you could put the query somewhere else (github?) , I could edit it into the original question. — wildplasser, Jun 28 '12 at 20:09

score 4 · Answer 3 · edited Jul 17 '17 at 13:47

If your count(distinct(x)) is significantly slower than count(x) then you can speed up this query by maintaining x value counts in different table, for example table_name_x_counts (x integer not null, x_count int not null), using triggers. But your write performance will suffer and if you update multiple x values in single transaction then you'd need to do this in some explicit order to avoid possible deadlock.

score 1 · Answer 4 · answered Mar 14 '18 at 13:03

I was also searching same answer, because at some point of time I needed total_count with distinct values along with limit/offset.

Because it's little tricky to do- To get total count with distinct values along with limit/offset. Usually it's hard to get total count with limit/offset. Finally I got the way to do -

SELECT DISTINCT COUNT(*) OVER() as total_count, * FROM table_name limit 2 offset 0;

Query performance is also high.

score 0 · Answer 5 · answered Nov 08 '22 at 10:16

I had a similar problem, but I had multiple columns I wanted to count. So I tried these 2 queries.

Count Distinct:

SELECT
       to_char(action_date, 'YYYY-MM') as "Month",
       count(*) as "Count",
       count(distinct batch_id)
FROM transactions t
         JOIN batches b on t.batch_id = b.id
GROUP BY to_char(action_date, 'YYYY-MM')
ORDER BY to_char(action_date, 'YYYY-MM');

Sub-Query:

WITH batch_counts AS (
    SELECT to_char(action_date, 'YYYY-MM') as "Month",
           COUNT(*)                        as t_count
    FROM transactions t
             JOIN batches b on t.batch_id = b.id
    GROUP BY b.id
)
SELECT "Month",
       SUM(t_count) as "Transactions",
       COUNT(*)     as "Batches"
FROM batch_counts
GROUP BY "Month"
ORDER BY "Month";

I ran both of these queries multiple on my test data of about 100k rows, the sub-query approach ran in ~90ms on average, but the count distinct approach took about ~200ms on average.

postgresql COUNT(DISTINCT ...) very slow

5 Answers5

Linked