I was troubleshooting a few slow SQL queries today and don't quite understand the performance difference below:
When trying to extract the max(timestamp)
from a data table based on some condition, using MAX()
is slower than ORDER BY timestamp LIMIT 1
if a matching row exists, but considerably faster if no matching row is found.
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)
Time: 1314.544 ms
SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)
Time: 10.890 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms
SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms
There are indexes on (timestamp)
and (sensor_id, timestamp)
, and I noticed that Postgres uses very different query plans and indexes for both cases:
QUERY PLAN (ORDER BY)
--------------------------------------------------------------------------------------------------------
Limit (cost=0.43..9.47 rows=1 width=8)
-> Nested Loop (cost=0.43..396254.63 rows=43823 width=8)
Join Filter: (data.sensor_id = sensors.id)
-> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12)
-> Materialize (cost=0.00..6.70 rows=2 width=4)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
(7 rows)
QUERY PLAN (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3680.59..3680.60 rows=1 width=8)
-> Nested Loop (cost=0.43..3571.03 rows=43823 width=8)
-> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4)
Filter: (station_id = 4)
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12)
Index Cond: (sensor_id = sensors.id)
(6 rows)
So my two questions are:
- Where does this performance difference come from? I've seen the accepted answer here MIN/MAX vs ORDER BY and LIMIT, but that doesn't quite seem to apply here. Any good resources would be appreciated.
- Are there better ways to increase performance in all cases (matching row vs no matching row) than adding an
EXISTS
check?
EDIT to address the questions in the comments below. I kept the initial query plans above for future reference:
Table definitions:
Table "public.sensors"
Column | Type | Modifiers
----------------------+------------------------+-----------------------------------------------------------------
id | integer | not null default nextval('sensors_id_seq'::regclass)
station_id | integer | not null
....
Indexes:
"sensor_primary" PRIMARY KEY, btree (id)
"ind_station_id" btree (station_id, id)
"ind_station" btree (station_id)
Table "public.data"
Column | Type | Modifiers
-----------+--------------------------+------------------------------------------------------------------
id | integer | not null default nextval('data_id_seq'::regclass)
timestamp | timestamp with time zone | not null
sensor_id | integer | not null
avg | integer |
Indexes:
"timestamp_ind" btree ("timestamp" DESC)
"sensor_ind" btree (sensor_id)
"sensor_ind_timestamp" btree (sensor_id, "timestamp")
"sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)
Note that I added ind_station_id
on sensors
just now after @Erwin's suggestion below. Timings haven't really changed drastically, still >1200ms
in the ORDER BY DESC + LIMIT 1
case and ~0.9ms
in the MAX
case.
Query Plans:
QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
Buffers: shared hit=3418066 read=47326
-> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
Join Filter: (data.sensor_id = sensors.id)
Buffers: shared hit=3418066 read=47326
-> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
Buffers: shared hit=3418065 read=47326
-> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)
QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
Buffers: shared hit=1
-> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
Buffers: shared hit=1
-> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (station_id = 4)
Heap Fetches: 0
Buffers: shared hit=1
-> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed)
Index Cond: (sensor_id = sensors.id)
Heap Fetches: 0
Planning time: 0.435 ms
Execution time: 0.048 ms
(13 rows)
So just like in the earlier explains, ORDER BY
does a Scan using timestamp_in on data
, which is not done in the MAX
case.
Postgres version:
Postgres from the Ubuntu repos: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit
Note that there are NOT NULL
constraints in place, so ORDER BY
won't have to sort over empty rows.
Note also that I'm largely interested in where the difference comes from. While not ideal, I can retrieve data relatively quickly using EXISTS (<1ms)
and then SELECT (~11ms)
.