1

I would like to know if performing a SELECT DISTINCT query implies a sequential scan, and how I can optimize it.

I created a dummy table and confirmed that when there is no index, SELECT DISTINCT does a Seq Scan.

test=# create table test2 (id SERIAL, t1 text);
CREATE TABLE
test=# insert into test2 select generate_series(0, 100000) AS id, md5(random()::text) AS t1;
INSERT 0 100001
test=# explain analyze select distinct t1 from test2;

Results in:

                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2157.97..2159.97 rows=200 width=32) (actual time=54.086..77.352 rows=100000 loops=1)
   Group Key: t1
   ->  Seq Scan on test2  (cost=0.00..1893.18 rows=105918 width=32) (actual time=0.012..12.232 rows=100001 loops=1)
 Planning time: 0.079 ms
 Execution time: 86.345 ms
(5 rows)

When we create index:

test=# create index test2_idx_t1 on test2 (t1);
CREATE INDEX
test=# explain analyze select distinct t1 from test2;

Results in:

first time:

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2084.01..2086.01 rows=200 width=32) (actual time=48.871..74.617 rows=100000 loops=1)
   Group Key: t1
   ->  Seq Scan on test2  (cost=0.00..1834.01 rows=100001 width=32) (actual time=0.009..9.891 rows=100001 loops=1)
 Planning time: 0.145 ms
 Execution time: 83.564 ms
(5 rows)

second time and onwards:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.42..7982.42 rows=100001 width=33) (actual time=0.016..80.949 rows=100000 loops=1)
   ->  Index Only Scan using test2_idx_t1 on test2  (cost=0.42..7732.42 rows=100001 width=33) (actual time=0.015..53.396 rows=100001 loops=1)
         Heap Fetches: 100001
 Planning time: 0.053 ms
 Execution time: 87.552 ms
(5 rows)
  1. Why is it doing a Seq Scan when queried the first time after the index was created?
  2. Why is the index scan more expensive than the seq scan in this case, and why is the query planner choosing it?
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
mc9
  • 6,121
  • 13
  • 49
  • 87
  • https://stackoverflow.com/questions/52638557/postgresql-huge-performance-difference-when-using-in-vs-not-in/52638582#52638582 – dwir182 Oct 05 '18 at 01:32
  • @dwir182 I don't see how that link relates to this question. Would you care to elaborate? – mc9 Oct 05 '18 at 01:35
  • I am sorry wrong link.. I am not good enough to tell this but if you are asking `index scan` and `seq scan` and `cost index so expensive` hope this link will help you [Perform Seq Scan Not Index Scan](https://stackoverflow.com/questions/5203755/why-does-postgresql-perform-sequential-scan-on-indexed-column) and [Index Scan Lower than Seq Scan](https://dba.stackexchange.com/questions/180271/why-index-scan-is-slower-then-seq-scan) – dwir182 Oct 05 '18 at 01:57
  • Is this for a particular practical example, or are you just researching it? Your data set is such that the entire index may be in ram for starters. When querying the full dataset, it's not likely that a btree index will be used in general, but obviously it can be. If you have a practical example of what you're trying to do, it's highly likely that normalization is the best answer via a separate table and foreign key. – Joe Love Oct 05 '18 at 06:15

1 Answers1

0

To get the result of a query that is about all the rows in a table, the whole table has to be scanned.

The only way you can avoid a sequential table scan is to have an index on t1 and have a recently vacuumed table so that most blocks are "all visible". Then an "index only scan" can be used, which is usually cheaper.

Why is the index only scan not used right away? I cannot answer that with absolute certainty, but a good guess is that autovacuum was still busy on the table when you ran the query the first time.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263