4

In a Postgres database, I am querying distinct values of MY_DATE in a large table with 300 million rows. There are about 400 of them and the column MY_DATE is indexed.

Select distinct  MY_DATE from MY_TABLE;

The query runs for 22 min.

The same query on my Oracle DB with the exact same data-set and the same index definition runs 11 seconds.

The query plan shows that the query is using the index:

EXPLAIN Select distinct  MY_DATE from MY_TABLE LIMIT 200;

gives:

QUERY PLAN
Limit  (cost=0.57..7171644.14 rows=200 width=8)
  ->  Unique  (cost=0.57..15419034.24 rows=430 width=8)
        ->  Index Only Scan using idx_obsdate on my_table  (cost=0.57..14672064.14 rows=298788038 width=8)

When I limit the results, the query can become much faster. Ee.g.

Select distinct  MY_DATE from MY_TABLE LIMIT 5;

runs in sub-seconds.

but:

Select distinct  MY_DATE from MY_TABLE LIMIT 50;

already takes minutes. Time seems to increase exponentially with the LIMIT clause.

I expect the Postgres query to run in seconds, as my OracleDB does. 20 minutes for an index scan - even for a large table - seems way off the mark.

Any suggestions what causes the issue and what I can do?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
tomanizer
  • 851
  • 6
  • 16

1 Answers1

6

distinct values ... 300 million rows ... about 400 of them ... column ... indexed.

There are much faster techniques for this. Emulating a loose index scan (a.k.a. skip scan), and assuming my_date is defined NOT NULL (or we can ignore NULL values):

WITH RECURSIVE cte AS (
   SELECT min(my_date) AS my_date
   FROM   my_table

   UNION ALL
   SELECT (SELECT my_date
           FROM   my_table 
           WHERE  my_date > cte.my_date
           ORDER  BY my_date
           LIMIT  1)
   FROM   cte
   WHERE  my_date IS NOT NULL
   )
TABLE  cte;

Related:

Using the index you mentioned it should finish in milliseconds.

Oracle DB ... 11 seconds.

Because Oracle has native index skip scans and Postgres does not. There are ongoing efforts to implement similar functionality in Postgres 12.

Currently (Postgres 11), while the index is used to good effect, even in an index-only scan, Postgres cannot skip ahead and has to read index tuples in sequence. Without LIMIT, the complete index has to be scanned. Hence we see in your EXPLAIN output:

Index Only Scan ... rows=298788038

The suggested new query achieves the same with reading 400 index tuples (one per distinct value). Big difference.

With LIMIT (and no ORDER BY!) like you tested, Postgres stops as soon as enough rows are retrieved. Increasing the limit has a linear effect. But if the number of rows per distinct value can vary, so does the added cost.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks Erwin for the good explanation. Unfortunately your query also seems to run slow: CTE Scan on cte (cost=1061000068.83..1061000070.85 rows=101 width=8) CTE cte -> Recursive Union (cost=7872574.05..1061000068.83 rows=101 width=8) -> WorkTable Scan on cte cte_1 (cost=0.00..105312749.28 rows=10 width=8) -> Finalize Aggregate (cost=7872574.05..7872574.06 rows=1 width=8) Filter: (observation_date IS NOT NULL) -> Gather (cost=7872573.83..7872574.04 rows=2 width=8) – tomanizer Apr 13 '19 at 13:02
  • Workers Planned: 2 -> Partial Aggregate (cost=7871573.83..7871573.84 rows=1 width=8) -> Parallel Seq Scan on observations_2018feed_prod (cost=0.00..7562306.27 rows=123707027 width=8) – tomanizer Apr 13 '19 at 13:04
  • is this suggesting it will do a parallel scan on 123 million rows? – tomanizer Apr 13 '19 at 13:05
  • @tomsnizer: that shouldn't be. you shouldn't see a seq scan. if you provide postgres version, table and index definition (`CREATE ... ` scripts) *in the question*, I may able to find why ... – Erwin Brandstetter Apr 13 '19 at 22:21
  • 1
    I dropped the index and recreated the index. And now your query does run in 45ms! Thank you so much. – tomanizer Apr 15 '19 at 09:24
  • @ErwinBrandstetter can you provide an example of using this recursive CTE method with multiple columns from the table? If I have, say, four columns that are not null that I want to select distinct on, how do I do that with a recursive CTE? – jkeys Aug 19 '21 at 02:58
  • 1
    @jkeys: Can be implemented efficiently. I suggest you start a new *question* for your question. – Erwin Brandstetter Aug 23 '21 at 21:41