Slow query in postgres using count distinct

Question

My goal is to create a query that will return a count of unique customers who have purchased in a 365 day window. I created the query below in postgres and and the resulting queries are very slow. My table is 812,024 rows of just order dates and customer id's. When I remove the distinct statement, I can get the query to return a result in approx 60 seconds, with it, I have yet to finish. I created an index on (order_date, id). I am a complete newb to SQL, this is really the first time i have ever done anything with it, and after trying to find a solution to this problem all day, I could not find anything that I could get to work, even though I have seen a lot about the slow performace of distinct.

SELECT
    (d1.Ordered) AS Ordered,
    COUNT(distinct d2.ID) Users
FROM
(
    SELECT order_date AS Ordered
    FROM orders
    GROUP BY order_date
) d1 
INNER JOIN
(
    SELECT order_date AS Ordered, id
    FROM orders
) d2
ON d2.Ordered BETWEEN d1.Ordered - 364 AND d1.Ordered
GROUP BY d1.Ordered
ORDER BY d1.Ordered

"Sort  (cost=3541596.30..3541596.80 rows=200 width=29)"
"  Sort Key: orders_1.order_date"
"  ->  HashAggregate  (cost=3541586.66..3541588.66 rows=200 width=29)"
"        ->  Nested Loop  (cost=16121.73..3040838.52 rows=100149627 width=29)"
"              ->  HashAggregate  (cost=16121.30..16132.40 rows=1110 width=4)"
"                    ->  Seq Scan on orders orders_1  (cost=0.00..14091.24 rows=812024 width=4)"
"              ->  Index Only Scan using x on orders  (cost=0.43..1822.70 rows=90225 width=29)"
"                    Index Cond: ((order_date >= (orders_1.order_date - 364)) AND (order_date <= orders_1.order_date))"

Not quite sure if I understand your setup here...mind giving us the table create statement on Orders and let us know how the order relation with users works? I don't see anything to do with users in the existing query...you are joining dates to dates and I don't quite get how there is a distinct user in there — Twelfth, Jul 29 '14 at 22:37
I don't know if I made it clear, but each row will have a date and the number of unique customers who purchased in the 365 days prior to that day. So the number for today would be the count of unique customers for the last year, and this would be a sliding range, with an entry for every day since there have been orders. — lbollar, Jul 29 '14 at 23:03
How commonly does the same customer have multiple purchases on the same day? — Erwin Brandstetter, Jul 29 '14 at 23:35
it needs to count only once customers who have purchased in the last year. If they have purchased one or more times in the last year, they fall under the definition of active customer, and therefore I only want to count them once. I want to track this metric daily to see what the active customers total is on a daily basis. — lbollar, Jul 29 '14 at 23:48
I think I have understood that. My question is: how many daily duplicates are in your table? How commonly do you have rows for the same user on the same day? — Erwin Brandstetter, Jul 29 '14 at 23:53
Sorry Erwin, very rarely, i would say less than 10 times per month. — lbollar, Jul 29 '14 at 23:57
9.3.? Are you on the latest point release? To rule out possible bugs in early versions. — Erwin Brandstetter, Jul 30 '14 at 14:59

score 2 · Answer 1 · answered Jul 29 '14 at 22:55

No need for the self-join, use generate_series

select
    g.order_date as "Ordered",
    count(distinct o.id) as "Users"
from
    generate_series(
        (select min(order_date) from orders),
        (select max(order_date) from orders),
        '1 day'
    ) g (order_date)
    left join
    orders o on o.order_date between g.order_date - 364 and g.order_date
group by 1
order by 1

score 1 · Answer 2 · answered Jul 29 '14 at 22:47

You haven't shown your schema, so some guesswork here. Change column names etc as appropriate.

SELECT 
  count(DISTINCT users.user_id)
FROM users
INNER JOIN order_date ON (users.user_id = orders.user_id)
WHERE orders.order_date > current_date - INTERVAL '1' YEAR;

or

SELECT 
  count(users.user_id)
FROM users
INNER JOIN order_date ON (users.user_id = orders.user_id)
WHERE orders.order_date > current_date - INTERVAL '1' YEAR
GROUP BY users.user_id;

score 0 · Accepted Answer · edited May 23 '17 at 12:29

Assuming actual date types.

SELECT d.day, count(distinct o.id) AS users_past_year
FROM  (
   SELECT generate_series(min(order_date), max(order_date), '1 day')::date AS day
   FROM   orders         -- single query
   ) d
LEFT JOIN (              -- fold duplicates on same day right away
   SELECT id, order_date
   FROM   orders
   GROUP  BY 1,2
   ) o ON o.order_date >  d.day - interval '1 year' -- exclude
      AND o.order_date <= d.day                     -- include
GROUP  BY 1
ORDER  BY 1;

Folding multiple purchases from the same user on the same day first only makes sense if that's a common thing. Else it will be faster to omit that step and simply left-join to the table orders instead.

It's rather odd that orders.id would be the ID of the user. Should be named something like user_id.

If you are not comfortable with generate_series() in the SELECT list (which works just fine), you can replace that with a LATERAL JOIN in Postgres 9.3+.

FROM  (SELECT min(order_date) AS a
            , max(order_date) AS z FROM orders) x
    , generate_series(x.a, x.z, '1 day') AS d(day)
LEFT JOIN ...

Note that day is type timestamp in this case. Works the same. You may want to cast.

General performance tips

I understand this is a read-only table for a single user. This simplifies things.
You already seem to have an index:

CREATE INDEX orders_mult_idx ON orders (order_date, id);

That's good.

Some things to try:

Basics

Of course, the usual performance advice applies:
https://wiki.postgresql.org/wiki/Slow_Query_Questions
https://wiki.postgresql.org/wiki/Performance_Optimization

Streamline table

Cluster your table using this index once:

CLUSTER orders USING orders_mult_idx;

This should help a bit. It also effectively runs VACUUM FULL on the table, which removes any dead rows and compacts the table if applicable.

Better statistic

ALTER TABLE orders ALTER COLUMN number SET STATISTICS 1000;
ANALYZE orders;

Explanation here:

Configuration parameter work_mem in PostgreSQL on Linux

Allocate more RAM

Make sure you have ample resources allocated. In particular for shared_buffers and work_mem. You can do this temporarily for your session.

Experiment with planner methods

Try disabling nested loops (enable_nestloop) (in your session only). Maybe hash joins are faster. (I would be surprised, though.)

SET enable_nestedloop = off;
-- test ...

RESET enable_nestedloop;

Temporary table

Since this seems to be a "temporary table" by nature, you could try and make it an actual temporary table saved in RAM only. You need enough RAM to allocate enough temp_buffers. Detailed instructions:

How to delete duplicate entries?

Be sure to run ANALYZE manually. Temp tables are not covered by autovacuum.

Sorry it is an odd table. I was just trying to produce a one off metric of our active customer history, and mess around with sql for the first time, so i just dumped a csv that contained order dates in one column and the customer id that ordered on that date in the other column. I did not set it up liked you traditionally would with separate customers and orders tables. I was just going to delete the table once i got the data i needed. Sorry for that confusion. — lbollar, Jul 30 '14 at 00:06
I was able to return a result with this query, but it ended up taking 44 minutes, which is probably the same as the queries i had been able to come up with earlier. I'm sure there is probably something wrong with the way I have created the db or something, as this is the first time I have ever created one, but would you have any ideas why the query would be taking so long off the top of your head? — lbollar, Jul 30 '14 at 02:08
@lbollar: It *is* an expensive query. The `DISTINCT` forces Postgres to compute each daily values over the time-frame of a year, which will be much of your 800k rows for each day. 44 minutes still seems *excessive*, though. I added some hints that might help improve performance. — Erwin Brandstetter, Jul 30 '14 at 13:35
By just boosting the work_mem, i gained over 15 minutes on the query. I tried the rest of the suggestions other than disabling nested loops and creating a temporary table, although I will try these as well, but other than boosting them memory, nothing had any effect. I am considering this solved, your query was much better than my original query, great detail, and lots of extra reading that will help me out. This is my first SO question, and i will try to provide better details next time. Thank You. — lbollar, Jul 31 '14 at 01:33