5

I have the following query which fetches the id of the latest N observations for each station:

SELECT id
FROM (
  SELECT station_id, id, created_at,
         row_number() OVER(PARTITION BY station_id
                           ORDER BY created_at DESC) AS rn
  FROM (
      SELECT station_id, id, created_at
      FROM observations
  ) s
) s
WHERE rn <= #{n}
ORDER BY station_id, created_at DESC;

I have indexes on id, station_id, created_at.

This is the only solution I have come up with that can fetch more than a single record per station. However it is quite slow (154.0 ms for a table of 81000 records).

How can I speed up the query?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
max
  • 96,212
  • 14
  • 104
  • 165
  • https://wiki.postgresql.org/wiki/Slow_Query_Questions –  Sep 21 '14 at 09:49
  • Partitioning won't help in this case. Your observations table is under 8MB. It will fit into the server's memory. Your query plan contains a seq scan on the observations table. Question: how important it is to query up-to-date live data from the database? Would it be a problem if you could only query data that is not newer than - say - 2 hours? Can you tell us how much rows are goint to be in the observations table? (Just the magnitude) – nagylzs Sep 21 '14 at 10:17
  • you may want to create index on separate column using hash . CREATE INDEX name ON table USING hash (column); – igauravsehrawat Sep 21 '14 at 19:07
  • You have 81000 records. Crucial questions: 1.) How many distinct stations? 2.) Do you have a table listing all stations? If not, any problem with creating and maintaining one? 3.) As *always*: your version of Postgres? 4.) Table definition of `observations` (the `CREATE` statement or `\d observations` in psql)? A *much* faster query should be possible, depending on the number of stations ... – Erwin Brandstetter Sep 21 '14 at 22:16
  • Some more details: It´s an open source Rails app that collects wind data from cheap stations. Right now there is only about 3 stations that sample each 5 minutes (~ 288 obserations day, less when the 3G network is spotty.) Live site: http://www.blast.nu/. https://github.com/remote-wind/remote-wind – max Sep 22 '14 at 11:51
  • For this kind of data distribution my presented queries should work wonders. – Erwin Brandstetter Sep 22 '14 at 12:24

2 Answers2

11

Index

First, a multicolumn index will help:

CREATE INDEX observations_special_idx
ON observations(station_id, created_at DESC, id)

created_at DESC is a slightly better fit, but the index would still be scanned backwards at almost the same speed without DESC.

Assuming created_at is defined NOT NULL, else consider DESC NULLS LAST in index and query:

The last column id is only useful if you get an index-only scan out of it, which probably won't work if you add lots of new rows constantly. In this case, remove id from the index.

Simpler query (still slow)

Simplify your query, the inner subselect doesn't help:

SELECT id
FROM  (
  SELECT station_id, id, created_at
       , row_number() OVER (PARTITION BY station_id
                            ORDER BY created_at DESC) AS rn
  FROM   observations
  ) s
WHERE  rn <= #{n}  -- your limit here
ORDER  BY station_id, created_at DESC;

Should be a bit faster, but still slow.

Fast query

  • Assuming you have relatively few stations and relatively many observations per station.
  • Also assuming station_id id defined as NOT NULL.

To be really fast, you need the equivalent of a loose index scan (not implemented in Postgres, yet). Related answer:

If you have a separate table of stations (which seems likely), you can emulate this with JOIN LATERAL (Postgres 9.3+):

SELECT o.id
FROM   stations s
CROSS  JOIN LATERAL (
   SELECT o.id
   FROM   observations o
   WHERE  o.station_id = s.station_id  -- lateral reference
   ORDER  BY o.created_at DESC
   LIMIT  #{n}  -- your limit here
   ) o
ORDER  BY s.station_id, o.created_at DESC;

If you don't have a table of stations, the next best thing would be to create and maintain one. Possibly add a foreign key reference to enforce relational integrity.

If that's not an option, you can distill such a table on the fly. Simple options would be:

SELECT DISTINCT station_id FROM observations;
SELECT station_id FROM observations GROUP BY 1;

But either would need a sequential scan and be slow. Make Postgres use above index (or any btree index with station_id as leading column) with a recursive CTE:

WITH RECURSIVE stations AS (
   (                  -- extra pair of parentheses ...
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )                  -- ... is required!
   UNION ALL
   SELECT (SELECT o.station_id
           FROM   observations o
           WHERE  o.station_id > s.station_id
           ORDER  BY o.station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL  -- serves as break condition
   )
SELECT station_id
FROM   stations
WHERE  station_id IS NOT NULL;      -- remove dangling row with NULL

Use that as drop-in replacement for the stations table in the above simple query:

WITH RECURSIVE stations AS (
   (
   SELECT station_id
   FROM   observations
   ORDER  BY station_id
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT o.station_id
           FROM   observations o
           WHERE  o.station_id > s.station_id
           ORDER  BY o.station_id
           LIMIT  1)
   FROM   stations s
   WHERE  s.station_id IS NOT NULL
   )
SELECT o.id
FROM   stations s
CROSS  JOIN LATERAL (
   SELECT o.id, o.created_at
   FROM   observations o
   WHERE  o.station_id = s.station_id
   ORDER  BY o.created_at DESC
   LIMIT  #{n}  -- your limit here
   ) o
WHERE  s.station_id IS NOT NULL
ORDER  BY s.station_id, o.created_at DESC;

This should still be faster than what you had by orders of magnitude.

db<>fiddle here
Old sqlfiddle

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks for the very detailed answer, will try it out tonight. – max Sep 22 '14 at 15:24
  • runs at about 35 ms which is a huge improvement. Thanks! – max Sep 22 '14 at 18:48
  • @papirtiger: Which one? n = ? With or without `stations` table? Did you create the index (and run `ANALYZE checks`)? Do you see index-only scans in the `EXPLAIN ANALYZE` output? The improvement is nice, but I have seen better results. – Erwin Brandstetter Sep 22 '14 at 21:23
  • I have a `stations` table (o.stations_id is the foreign key) so I tried the first `JOIN LATERAL` query. Adding the `observations_special_idx` did not seem to to have any major effect. I have only tried it on my local machine though since I need to wait for my collaborator to update postgres on Heroku. – max Sep 22 '14 at 22:12
  • @papirtiger: Unless the table isn't *vacuumed* enough or heavily written, you should see `Index Only Scan using observations_special_idx on observations` like in the fiddle (check "View Execution Plan"). Do you see that in your `EXPLAIN` output? Note, I improved the index some more with `DESC`, but that shouldn't make much of a difference. The index should be used *in any case*, and have *major* effect. – Erwin Brandstetter Sep 22 '14 at 22:37
  • Sort (cost=9.04..9.16 rows=50 width=16) Sort Key: s.id, observations.created_at -> Nested Loop (cost=0.42..7.63 rows=50 width=16) -> Seq Scan on stations s (cost=0.00..1.05 rows=5 width=4) -> Limit (cost=0.42..1.12 rows=10 width=12) -> Index Scan Backward using observations_special_idx on observations (cost=0.42..1422.69 rows=20382 width=12) Index Cond: (station_id = s.id) – max Sep 22 '14 at 22:52
  • @papirtiger: OK, looks good. Use the updated index with `created_at DESC`. If you don't need sorted output, it gets a bit cheaper, yet. – Erwin Brandstetter Sep 22 '14 at 23:18
1

This is a good anwser only if you are not required to query up-to-date live data.

Preparation (requires postgresql 9.3)

drop materialized view test;
create materialized view test as select * from (
  SELECT station_id, id, created_at,
      row_number() OVER(
          PARTITION BY station_id
          ORDER BY created_at DESC
      ) as rn
  FROM (
      SELECT
          station_id,
          id,
          created_at
      FROM observations
  ) s
 ) q WHERE q.rn <= 100 -- use a value that will be your max limit number for further queries
ORDER BY station_id, rn DESC ;


create index idx_test on test(station_id,rn,created_at);

How to query data:

select * from test where rn<10 order by station_id,created_at;

Your original query was 281 ms on my machine and this new was 15 ms.

How to update the view with fresh data:

refresh materialized view test;

I have another solution that does not require materialized view and works with live, up-to-date data. But given that you don't need up-to-date data, this materialized view is much more efficient.

nagylzs
  • 3,954
  • 6
  • 39
  • 70
  • The *latest* records of a big table are hard (close to impossible) to cover with a materialized view, which is a better fit for read-only data. – Erwin Brandstetter Sep 22 '14 at 12:54
  • This is why I was asking question like how much data will there be. His observations table is 8MB. It is very far from being big. It is also interesting if he will be updating/deleting rows, or just adding new rows to that table. I have a lightweight solution that works if rows are only added, not updated or deleted. And another one that uses an index, but slows down the insertion of new observations. There will always be a tradeoff. – nagylzs Sep 22 '14 at 13:04
  • You are right about the tradeoff. The art is to get a good deal in these trades. A materialized view is just not the best tool for *latest* rows, since it only covers a snapshot from the past by definition - unless you refresh with every new entry automatically, which would be a high price to pay. – Erwin Brandstetter Sep 22 '14 at 13:10
  • The observation data is pretty much read-only, the table is still relatively small (80k rows) but if we get some sponsorship money to build/place more weather stations could grow exponentially. – max Sep 22 '14 at 15:23
  • It means that you will NEVER update or delete rows from this table. In this case, it would be better to update your "row number" from a trigger, when you insert the rows. Then you can create an index on the row number, and your whole query becomes a simple index scan... – nagylzs Sep 22 '14 at 16:59