I've got a table users
with fields id
and email
. id
is the primary key and email
is indexed as well.
database> \d users
+-----------------------------+-----------------------------+-----------------------------------------------------+
| Column | Type | Modifiers |
|-----------------------------+-----------------------------+-----------------------------------------------------|
| id | integer | not null default nextval('users_id_seq'::regclass) |
| email | character varying | |
+-----------------------------+-----------------------------+-----------------------------------------------------+
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
"index_users_on_email" UNIQUE, btree (email)
If I query the table with a distinct on (email)
clause in a subquery I get a significant performance penalty.
database> explain (analyze, buffers)
select
id
from (
select distinct on (email)
id
from
users
) as t
where id = 123
+-----------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------------------------------------------------|
| Subquery Scan on t (cost=8898.69..10077.84 rows=337 width=4) (actual time=221.133..250.782 rows=1 loops=1) |
| Filter: (t.id = 123) |
| Rows Removed by Filter: 67379 |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Unique (cost=8898.69..9235.59 rows=67380 width=24) (actual time=221.121..247.582 rows=67380 loops=1) |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Sort (cost=8898.69..9067.14 rows=67380 width=24) (actual time=221.120..239.573 rows=67380 loops=1) |
| Sort Key: users.email |
| Sort Method: external merge Disk: 2304kB |
| Buffers: shared hit=2824, temp read=288 written=289 |
| -> Seq Scan on users (cost=0.00..3494.80 rows=67380 width=24) (actual time=0.009..9.714 rows=67380 loops=1) |
| Buffers: shared hit=2821 |
| Planning Time: 0.243 ms |
| Execution Time: 251.258 ms |
+-----------------------------------------------------------------------------------------------------------------------------+
Compare this with distinct on (id)
whose cost is less than one thousandth of the previous query.
database> explain (analyze, buffers)
select
id
from (
select distinct on (id)
id
from
users
) as t
where id = 123
+-----------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------------------------------------------------|
| Unique (cost=0.29..8.31 rows=1 width=4) (actual time=0.021..0.022 rows=1 loops=1) |
| Buffers: shared hit=3 |
| -> Index Only Scan using users_pkey on users (cost=0.29..8.31 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1) |
| Index Cond: (id = 123) |
| Heap Fetches: 1 |
| Buffers: shared hit=3 |
| Planning Time: 0.090 ms |
| Execution Time: 0.034 ms |
+-----------------------------------------------------------------------------------------------------------------------------+
Why is this?
The real problem I'm having is that I'm trying to create a view that does distinct on
an indexed column that isn't unique and the performance is very bad.