0

I have a Messages table with user_id field (string). A query for total unique users is extremely slow with over a million records.

Message.where(created_at: start_date..end_date).select(:user_id).distinct(:user_id).count
=> (120145.6ms)  SELECT DISTINCT COUNT(DISTINCT "messages"."user_id") FROM "messages" WHERE ("messages"."created_at" BETWEEN '2016-05-14 04:00:00.000000' AND '2016-06-13 03:59:59.999000')

I have indexes on user_id & created_at, but postgres doesn't seem to use them:

Schema

add_index "messages", ["user_id"], name: "index_messages_on_user_id", using: :btree
add_index "messages", ["created_at"], name: "index_messages_on_created_at", using: :btree   

PG Explain

                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=56111.04..56291.89 rows=18085 width=29)
   Group Key: user_id
   ->  Seq Scan on messages  (cost=0.00..52215.65 rows=1558153 width=29)
         Filter: ((created_at >= '2016-05-14 04:00:00'::timestamp without time zone) AND (created_at <= '2016-06-13 03:59:59.999'::timestamp without time zone))
(4 rows)

Why are the indexes not used? Any tips for speeding up the query?

mnort9
  • 1,810
  • 3
  • 30
  • 54

1 Answers1

0

The index is probably not selective enough, so Postgres decides that it would need to read the entire table. For your query, I would recommend an index on messages(created_at, user_id). Postgres would be encouraged to use the index instead of the raw data, because it is a covering index.

Another idea would be to make created_at (or created_at, user_id) a clustered index. This probably will not affect insert performance, because new records would have higher values for created_at and go at the end anyway. This would reduce I/O.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I tried `messages(created_at, user_id)` and it seems to ignore that index as well :( – mnort9 Jun 12 '16 at 15:02
  • @mnort9 . . . Try removing the `select distinct`. It makes no sense for this query and it might be confusing the optimizer. – Gordon Linoff Jun 12 '16 at 23:17
  • Without `select`, I can build the AR query using `Message.distinct.count(:user_id)`, but it produces the same `SELECT DISTINCT` sql – mnort9 Jun 13 '16 at 00:19
  • I think I've got it! This query performs much better `(585.5ms) SELECT COUNT(*) FROM (SELECT DISTINCT user_id FROM messages)`. ref: http://stackoverflow.com/questions/11250253/postgresql-countdistinct-very-slow – mnort9 Jun 13 '16 at 00:46
  • Oh, right . . . Postgres and Hive both have the same problem in their optimizers. – Gordon Linoff Jun 13 '16 at 01:53