3

I have a Postgres table with anonymous_id(string) and timestamp (datetime) columns created by Segment.com when users visit our website.

There are ~5M rows, ~1M distinct anonymous_id

I'd like to query the number of distinct anonymous_ids found per month.

I have this so far, which works, but timesout in PSequel (I can run it several times and restrict the date)

SELECT count(1), "month"
FROM (
    SELECT DISTINCT anonymous_id, 
    date_trunc('month', "timestamp") as "month"
    FROM pages
    -- WHERE "timestamp" between '2018-01-01' and '2018-02-01'
) as dt
GROUP BY 2
ORDER BY 2

I have an index on both anonymous_id and timestamp

Results of EXPLAIN ANALYSE

                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1667977.72..1667978.22 rows=200 width=8) (actual time=115861.803..115861.807 rows=27 loops=1)
   Sort Key: (date_trunc('month'::text, pages."timestamp"))
   Sort Method: quicksort  Memory: 26kB
   ->  HashAggregate  (cost=1667968.07..1667970.07 rows=200 width=8) (actual time=115861.763..115861.766 rows=27 loops=1)
         Group Key: (date_trunc('month'::text, pages."timestamp"))
         ->  Unique  (cost=1554502.82..1592324.57 rows=5042900 width=45) (actual time=97492.062..115468.396 rows=1158934 loops=1)
               ->  Sort  (cost=1554502.82..1567110.07 rows=5042900 width=45) (actual time=97492.060..113983.496 rows=5042900 loops=1)
                     Sort Key: pages.anonymous_id, (date_trunc('month'::text, pages."timestamp"))
                     Sort Method: external merge  Disk: 285936kB
                     ->  Seq Scan on pages  (cost=0.00..682820.25 rows=5042900 width=45) (actual time=0.088..25601.944 rows=5042900 loops=1)
 Planning time: 10.335 ms
 Execution time: 115910.353 ms
(12 rows)

Current Indexes (including combined index as suggested by Thorsten Kettner below)

Indexes:
    "pages_pkey" PRIMARY KEY, btree (id)
    "idx_anonymous_id" btree (anonymous_id)
    "idx_date_trunc_anon_id" btree (date_trunc('month'::text, timezone('UTC'::text, "timestamp")), anonymous_id)
    "idx_path" btree (path)
    "idx_timestamp" btree ("timestamp")
    "idx_url" btree (url)
    "idx_user_id" btree (user_id)
    "pages_activity_type_idx" btree (activity_type)
Guy Bowden
  • 4,997
  • 5
  • 38
  • 58

2 Answers2

1

Only thing I can think of is get rid of the derived table, since you don't need it:

SELECT count(distinct anonymous_id), date_trunc('month', "timestamp") AS "month"
FROM pages
GROUP BY date_trunc('month', "timestamp")
ORDER BY date_trunc('month', "timestamp");
Greg Viers
  • 3,473
  • 3
  • 18
  • 36
  • Increasing `work_mem` in addition to that should also help –  Feb 02 '18 at 14:25
  • I actually get a 3x speed up when using the derived table, if I restrict the set to a single month's entries. (4s vs 13s) – Guy Bowden Feb 02 '18 at 15:32
  • @Guy Bowden: You apply the same `WHERE` clause to both queries and yours is much faster? That is surprising Postgre's optimizer should do much better than this. Anyway, if it's a single month you want to aggregate, then you don't need a `GROUP BY` clause anymore, which simplifies the query again. – Thorsten Kettner Feb 02 '18 at 17:29
  • As Thorsten says: if you restrict it to one month you can easily do : `SELECT COUNT(DISTINCT anonymous_id) FROM pages WHERE "timestamp" between '2018-01-01' and '2018-02-01'` (wanted to write it out for clarity) (PS: you probably want to use '<=' and '<' instead of between, although in the real world the difference will be small) – deroby Feb 05 '18 at 13:23
0

You want an index that starts with your group by expression. Ideally it also contains the anonymous_id:

CREATE INDEX idx1 ON pages( date_trunc('month', "timestamp") , anonymous_id);

If you want a WHERE clause then you need its criteria first:

CREATE INDEX idx2 ON pages( "timestamp" , date_trunc('month', "timestamp") , anonymous_id);

As has been shown by Greg Viers your query is more complicated than needed. I'd go with his one. The index needed is the same for both queries.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73