Performance of max() vs ORDER BY DESC + LIMIT 1

Question

I was troubleshooting a few slow SQL queries today and don't quite understand the performance difference below:

When trying to extract the max(timestamp) from a data table based on some condition, using MAX() is slower than ORDER BY timestamp LIMIT 1 if a matching row exists, but considerably faster if no matching row is found.

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)  
Time: 1314.544 ms

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)  
Time: 10.890 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms

There are indexes on (timestamp) and (sensor_id, timestamp), and I noticed that Postgres uses very different query plans and indexes for both cases:

QUERY PLAN (ORDER BY)                                              
--------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..9.47 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..396254.63 rows=43823 width=8)
          Join Filter: (data.sensor_id = sensors.id)
          ->  Index Scan using timestamp_ind on data  (cost=0.43..254918.66 rows=4710976 width=12)
          ->  Materialize  (cost=0.00..6.70 rows=2 width=4)
              ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
                  Filter: (station_id = 4)
(7 rows)

QUERY PLAN (MAX)                                               
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3680.59..3680.60 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..3571.03 rows=43823 width=8)
        ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
              Filter: (station_id = 4)
        ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12)
              Index Cond: (sensor_id = sensors.id)
(6 rows)

So my two questions are:

Where does this performance difference come from? I've seen the accepted answer here MIN/MAX vs ORDER BY and LIMIT, but that doesn't quite seem to apply here. Any good resources would be appreciated.
Are there better ways to increase performance in all cases (matching row vs no matching row) than adding an EXISTS check?

EDIT to address the questions in the comments below. I kept the initial query plans above for future reference:

Table definitions:

                                  Table "public.sensors"
        Column        |          Type          |                            Modifiers                            
----------------------+------------------------+-----------------------------------------------------------------
id                    | integer                | not null default nextval('sensors_id_seq'::regclass)
station_id            | integer                | not null
....

Indexes:
    "sensor_primary" PRIMARY KEY, btree (id)
    "ind_station_id" btree (station_id, id)
    "ind_station" btree (station_id)

                                  Table "public.data"
  Column   |           Type           |                            Modifiers                             
-----------+--------------------------+------------------------------------------------------------------
 id        | integer                  | not null default nextval('data_id_seq'::regclass)
 timestamp | timestamp with time zone | not null
 sensor_id | integer                  | not null
 avg       | integer                  |

Indexes:
    "timestamp_ind" btree ("timestamp" DESC)
    "sensor_ind" btree (sensor_id)
    "sensor_ind_timestamp" btree (sensor_id, "timestamp")
    "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)

Note that I added ind_station_id on sensors just now after @Erwin's suggestion below. Timings haven't really changed drastically, still >1200ms in the ORDER BY DESC + LIMIT 1 case and ~0.9ms in the MAX case.

Query Plans:

QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit  (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
  Buffers: shared hit=3418066 read=47326
  ->  Nested Loop  (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
        Join Filter: (data.sensor_id = sensors.id)
        Buffers: shared hit=3418066 read=47326
        ->  Index Scan using timestamp_ind on data  (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
              Buffers: shared hit=3418065 read=47326
        ->  Materialize  (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
              Buffers: shared hit=1
              ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
                    Index Cond: (station_id = 4)
                    Heap Fetches: 0
                    Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)

QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
   Buffers: shared hit=1
   ->  Nested Loop  (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
         Buffers: shared hit=1
         ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
               Index Cond: (station_id = 4)
               Heap Fetches: 0
               Buffers: shared hit=1
         ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12) (never executed)
               Index Cond: (sensor_id = sensors.id)
               Heap Fetches: 0
 Planning time: 0.435 ms
 Execution time: 0.048 ms
 (13 rows)

So just like in the earlier explains, ORDER BY does a Scan using timestamp_in on data, which is not done in the MAX case.

Postgres version: Postgres from the Ubuntu repos: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

Note that there are NOT NULL constraints in place, so ORDER BY won't have to sort over empty rows.

Note also that I'm largely interested in where the difference comes from. While not ideal, I can retrieve data relatively quickly using EXISTS (<1ms) and then SELECT (~11ms).

Please show `explain (buffers, analyze)` not just plain `explain`, as well as the output of `SELECT version();` — Craig Ringer, Dec 13 '15 at 09:46
While the queries say `WHERE sensor.station_id = 4` you have `Filter: (installation_id = 4)` in the `EXPLAIN` output. Please clarify. Also, you say you have an index on `(sensor_id, timestamp)`, but the index name in the `EXPLAIN` output is `timestamp_sensor_ind`, indicating the reverse order. Please supply your exact index definitions *and* table definitions of `data` and `sensors` - including all constraints. What you get with `\d tbl` in psql or complete `CREATE TABLE` scripts. — Erwin Brandstetter, Dec 13 '15 at 11:07
Sorry guys, I tried to clean up the naming a bit but apparently copied the wrong version :( The index actually has a cryptic name I gave it in testing and *is* on `(sensor_id, timestamp)`. The filter is on `station_id`. I'll get the `explain (buffers, analyze)` output tonight. — Geotob, Dec 13 '15 at 19:58
Don't forget table and index definition and Postgres version. And we are not notified of your answer unless you @-reply. — Erwin Brandstetter, Dec 14 '15 at 03:56
So long edit above @ErwinBrandstetter, hope it helps to understand what's going on here. — Geotob, Dec 14 '15 at 07:19
And also ping @CraigRinger, the extra info is in the question above now — Geotob, Dec 14 '15 at 07:19

Erwin Brandstetter · Accepted Answer · 2023-03-15T03:37:38.973

For performance, a matching index on sensor.station_id makes all the difference.

There is an actual difference between max() and ORDER BY DESC + LIMIT 1. Many people seem to miss that. null sorts last in default ascending sort order. Consequently, it sorts first in descending order. ORDER BY timestamp DESC LIMIT 1 returns null if one exists, while the aggregate function max() ignores null and returns the latest not-null timestamp. ORDER BY timestamp DESC NULLS LAST LIMIT 1 would be equivalent.

For your case, since your column d.timestamp is defined NOT NULL (as your update reveals), there is no effective difference. An index with DESC NULLS LAST and the same clause in the ORDER BY for the LIMIT query should still serve best. I suggest these indexes: (My query below builds on the 2nd one.)

sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)

Drop the other indexes ~~sensor_ind_timestamp~~ and ~~sensor_ind_timestamp_desc~~ unless they are in use otherwise (unlikely, but possible).

More importantly, there is another difficulty: The filter on the first table sensors returns few, but still (possibly) multiple rows. Postgres expects to find 2 rows (rows=2) according to your added query plan.
The perfect technique would be an index-skip-scan (a.k.a. loose index scan) for the second table data. But that is not currently implemented (up to Postgres 15). There are various workarounds. See:

Optimize GROUP BY query to retrieve latest row per user

The best should be:

SELECT d.timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4
ORDER  BY d.timestamp DESC NULLS LAST
LIMIT  1;

The choice between max() and ORDER BY / LIMIT hardly matters. You might as well:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4;

Or:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT max(timestamp) AS timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ) d
WHERE  s.station_id = 4;

Or even use a correlated subquery, shortest of all:

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM   sensors s
WHERE  station_id = 4;

Note the double parentheses!

The additional advantage of LIMIT in a LATERAL join is that you can retrieve arbitrary columns of selected row(s), not just the latest timestamp (one value).

This is great, thanks for taking the time, the detailed reply and the links! In terms of performance your last suggestion turns out to be the fastest: Timings for (matching row)/(no match) from first to last suggestion: 95ms/0.67ms, 95ms/0.70ms, 0.95ms/0.80ms. Now I need to read about cross joins. — Geotob, Dec 15 '15 at 08:24
@Geotob: `CROSS JOIN` is a rather trivial join without condition. `LATERAL` is the key feature. I added another link above. And another simple query with a subquery instead of the `LATERAL` join. — Erwin Brandstetter, Dec 15 '15 at 11:02

score 2 · Answer 2 · answered Dec 13 '15 at 10:10

2

The query plan shows index names timestamp_ind and timestamp_sensor_ind. But indexes like that do not help with a search for a particular sensor.

To resolve an equals query (like sensor.id = data.sensor_id) the column has to be the first in the index. Try to add an index that allows searching on sensor_id and, within a sensor, is sorted by timestamp:

create index sensor_timestamp_ind on data(sensor_id, timestamp);

Does adding that index speed up the query?

answered Dec 13 '15 at 10:10

Andomar

232,371
49
380
404

Thanks, the `timestamp_sensor_ind` was just confusingly named but on `data(sensor_id, timestamp)`all along. – Geotob Dec 14 '15 at 07:21

Performance of max() vs ORDER BY DESC + LIMIT 1

2 Answers2

Linked