PostgreSQL: get latest row for each time interval

Question

I have the following table. It is stored as a TimescaleDB hypertable. Data rate is 1 row per second.

CREATE TABLE electricity_data
(
    "time" timestamptz NOT NULL,
    meter_id integer REFERENCES meters NOT NULL,
    import_low double precision,
    import_normal double precision,
    export_low double precision,
    export_normal double precision,
    PRIMARY KEY ("time", meter_id)
)

I would like to get the latest row in a given time interval, over a period of time. For instance the latest record each month for the previous year. The following query works but is slow:

EXPLAIN ANALYZE
SELECT
DISTINCT ON (bucket)
time_bucket('1 month', "time", 'Europe/Amsterdam') AS bucket,
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-01-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY bucket DESC

Unique  (cost=0.42..542380.99 rows=200 width=40) (actual time=3654.263..59130.398 rows=12 loops=1)
  ->  Custom Scan (ChunkAppend) on electricity_data  (cost=0.42..514045.41 rows=11334231 width=40) (actual time=3654.260..58255.396 rows=11161474 loops=1)
        Order: time_bucket('1 mon'::interval, electricity_data.""time"", 'Europe/Amsterdam'::text, NULL::timestamp with time zone, NULL::interval) DESC
        ->  Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk  (cost=0.42..11530.51 rows=255951 width=40) (actual time=3654.253..3986.885 rows=255582 loops=1)
              Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
              Rows Removed by Filter: 24330
        ->  Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk  (cost=0.42..25777.81 rows=604553 width=40) (actual time=1.468..1810.493 rows=603808 loops=1)
              Index Cond: ((""time"" >= '2021-12-31 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
Planning Time: 57.424 ms
JIT:
  Functions: 217
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 43.496 ms, Inlining 18.805 ms, Optimization 2348.206 ms, Emission 1288.087 ms, Total 3698.594 ms
Execution Time: 59176.016 ms

Getting the latest row for a single month is instantaneous:

EXPLAIN ANALYZE
SELECT
"time",
import_low,
import_normal,
export_low,
export_normal
FROM electricity_data
WHERE meter_id = 1
AND "time" BETWEEN '2022-12-01T00:00:00 Europe/Amsterdam' AND '2023-01-01T00:00:00 Europe/Amsterdam'
ORDER BY "time" DESC
LIMIT 1

Limit  (cost=0.42..0.47 rows=1 width=40) (actual time=0.048..0.050 rows=1 loops=1)
  ->  Custom Scan (ChunkAppend) on electricity_data  (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.047..0.048 rows=1 loops=1)
        Order: electricity_data.""time"" DESC
        ->  Index Scan using _hyper_12_1533_chunk_electricity_data_time_idx on _hyper_12_1533_chunk  (cost=0.42..11530.51 rows=255951 width=40) (actual time=0.046..0.046 rows=1 loops=1)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
        ->  Index Scan Backward using ""1529_1849_electricity_data_pkey"" on _hyper_12_1529_chunk  (cost=0.42..25777.81 rows=604553 width=40) (never executed)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone) AND (meter_id = 1))
(...)
        ->  Index Scan using _hyper_12_1512_chunk_electricity_data_time_idx on _hyper_12_1512_chunk  (cost=0.42..8.94 rows=174 width=40) (never executed)
              Index Cond: ((""time"" >= '2022-11-30 23:00:00+00'::timestamp with time zone) AND (""time"" <= '2022-12-31 23:00:00+00'::timestamp with time zone))
              Filter: (meter_id = 1)
Planning Time: 2.162 ms
Execution Time: 0.152 ms

Is there a way to execute the query above for each month or custom time interval? Or is there a different way to speed up the first query?

Edit

The following query takes 10 seconds, which is much better, but still slower than the manual approach. An index does not seem to make a difference.

EXPLAIN ANALYZE
SELECT MAX("time") AS "time"
FROM electricity_data
WHERE meter_id = 1
    AND "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
    AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
GROUP BY time_bucket('1 month', "time", 'Europe/Amsterdam');

(... plan removed)
Planning Time: 50.463 ms
JIT:
  Functions: 451
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 76.476 ms, Inlining 0.000 ms, Optimization 13.849 ms, Emission 416.718 ms, Total 507.043 ms
Execution Time: 9910.058 ms

Wow, [time_bucket()](https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/) is a useful function. Where has it been all my report-generating life? — O. Jones, Feb 09 '23 at 11:56
@O.Jones It works pretty well for my use case of visualizing with Grafana. Thank you for the edit. — mhvis, Feb 09 '23 at 12:05
@O.Jones There's [`date_bin()`](https://www.postgresql.org/docs/14/functions-datetime.html#FUNCTIONS-DATETIME-BIN) in plain PostgreSQL 14.0+. Not sure why timescale doc compares `time_bucket()` to `date_trunc()` instead. — Zegarek, Feb 10 '23 at 10:41

davidk · Answer 1 · 2023-02-09T15:33:28.967

I'd recommend using the last aggregate and a continuous aggregate to solve this problem.

Like the previous poster, I'd also recommend an index on meter, time rather than the other way around, you can do this in your table definition by just changing the order of keys in your primary key definition.

CREATE TABLE electricity_data
(
    "time" timestamptz NOT NULL,
    meter_id integer REFERENCES meters NOT NULL,
    import_low double precision,
    import_normal double precision,
    export_low double precision,
    export_normal double precision,
    PRIMARY KEY ( meter_id, "time")
);

But that's a bit off topic. The basic query you'll want to do is something like:

SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'), 
    meter_id, 
    last(electricity_data, "time") 
FROM electricity_data 
GROUP BY 1, 2;

This is a bit confusing until you realize that the table itself is also a type in PostgreSQL - so you can ask for and return a composite type from this call to the last aggregate, which will get the latest value in the month or day or whatever you want.

Then you have to be able to treat that as a row again, so you can expand that by using parentheses and a .* which is how composite types can be expanded in PG.

SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
    meter_id, 
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1,2;

Now, in order to speed things up, you can turn that into a continuous aggregate which will make things much faster.

CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', "time", 'Europe/Amsterdam'),
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1, meter_id;

You'll note that I took the meter_id out of the initial select list because that's gonna come from our composite type and I don't need the redundant column, nor can I have two columns with the same name in a view, but I did keep meter_id in my group by.

So that'll speed things up nicely, but, if I were you, I might actually think about doing this on a daily basis and creating a hierarchical continuous aggregate for this type of thing.

CREATE MATERIALIZED VIEW last_meter_day WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', "time", 'Europe/Amsterdam'),
    (last(electricity_data, "time")).*
FROM electricity_data 
GROUP BY 1, meter_id;

CREATE MATERIALIZED VIEW last_meter_month WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month',time_bucket, 'Europe/Amsterdam') as month_bucket,
    (last(last_meter_day, time_bucket)).*
FROM last_meter_day 
GROUP BY 1, meter_id;

The reason for that is that we can't really refresh a monthly continuous aggregate all that often, it's much easier to refresh a daily aggregate and then roll that up into a monthly aggregate more frequently. You could also just have the daily aggregate and roll up to month on the fly in your query as that would be at most 30 days per meter, but of course that won't be as performant.

You'll then have to create continuous aggregate policies for these based on what you want to have happen on refresh.

I'd also suggest, depending on what you're trying to do with this, that you might want to take a look at counter_agg as it might be useful for you. I also recently wrote a post in our forum about how to use it with electricity meters that might be helpful for you depending on how you're processing this data.

Great forum post, very applicable. I have looked into continuous aggregates and have tried this. It works and is definitely a good solution. For me a perfect solution would be if it works for any given time interval. I know that you can roll up on the fly, but if there is a hierarchy you still need to determine from which table to query depending on the interval. Given that getting the latest row for a single month is instantaneous and the number of months is constant (max 100), I would expect that it should be possible to do under 1 second without a continuous aggregate. — mhvis, Feb 09 '23 at 16:04
the problem is the time_bucket - an index including it might help but I'm not sure if it would meet your needs then...I think you could make a recursive query or use a few CTEs to do this nicely, but I think the cagg approach would probably work best and just going with a day and going from there will probably be enough for most queries? I dunno I might be missing what you're asking if you want to maybe ask a new question with the specifics maybe I can answer there? — davidk, Feb 09 '23 at 16:08
Ah. okay I see your question does briefly mention that you have a need for more variable time frames. I still think that using daily caggs and re-aggregating is probably simplest in many cases but I think for the more general case what you'd need is a recursive CTE that used `WHERE "time" < [each_period] ORDER BY "time" DESC LIMIT 1` but writing recursive CTEs is kinda annoying. I'd look at the ways that folks use recursive CTEs to emulate skip scan in Postgres. — davidk, Feb 09 '23 at 16:20
Unfortunately, the best approach, which would be to try to use our skip scan optimization like this: https://stackoverflow.com/questions/75099438/sql-query-for-finding-latest-or-max-value-of-timestamp-from-table-corresponding/75114075#75114075 is not going to work because AFAIK we're not time bucket aware, you could file a Github issue with that as a feature request though... — davidk, Feb 09 '23 at 16:22
It's indeed what I'm asking. Probably using a daily aggregate will be good enough and I'm being too nitpicky/demanding here. I don't know yet about CTEs, I'm going to look into that. Thanks for all your suggestions. — mhvis, Feb 09 '23 at 16:26

score 1 · Answer 2 · answered Feb 09 '23 at 12:15

You can try an approach that uses a subquery to get the timestamp of the latest time in each bucket. Then, join that to your detail table.

        SELECT meter_id, MAX("time") "time"
          FROM electricity_data
          WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
            AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
          GROUP BY meter_id, 
                   time_bucket('1 month', "time", 'Europe/Amsterdam')

That gets you a virtual table with the latest time for each meter for each time bucket (month in this case). It can be accelerated with this index, the same as your primary key but with the columns in the opposite order. With the columns in that order the query can be satisfied with a relatively quick index scan.

CREATE INDEX meter_time ON electricity_data (meter_id, "time")

Then join that to your detail table. Like this.

SELECT d.meter_id
       time_bucket('1 month', d."time", 'Europe/Amsterdam') AS bucket,
       d."time",
       d.import_low,
       d.import_normal,
       d.export_low,
       d.export_normal
  FROM electricity_data d
  JOIN (
        SELECT meter_id, MAX("time") "time"
          FROM electricity_data
          WHERE "time" >= '2022-01-01T00:00:00 Europe/Amsterdam'
            AND "time" < '2023-01-01T00:00:00 Europe/Amsterdam'
          GROUP BY meter_id, 
                   time_bucket('1 month', "time", 'Europe/Amsterdam')
       ) last ON d."time" = last."time" 
             AND d.meter_id = last.meter_id
 ORDER BY d.meter_id, bucket DESC

(I'm not completely sure of the syntax in TimeScaleDB for columns that have the same name as reserved words like time, so this isn't tested.)

If you want just one meter, put a WHERE clause right before the last ORDER BY clause.

While this approach works, it doesn't really take advantage of some of the tools we have for this in TimescaleDB. I've posted an answer below that I think will be significantly more performant and easier to understand. This is actually one of the types of queries that we try to optimize, though I think sometimes we don't have enough examples for them! — davidk, Feb 09 '23 at 15:31

mhvis · Answer 3 · 2023-02-12T13:02:33.933

The other answers are likely more useful in most cases. I wanted a solution that works for any interval, without the need for continuous aggregates.

I ended up with the following query, using a lateral join. I use the lag function to compute energy consumption/generation in a time bucket (omitted below). Variables $__interval, $__timeFrom() and $__timeTo() specify the chosen bucket interval and time range.

SELECT bucket, import_low, import_normal, export_low, export_normal
FROM (
  SELECT
  tstzrange(
    -- Could also use date_trunc or date_bin
    time_bucket(INTERVAL '$__interval', d, 'Europe/Amsterdam'),
    time_bucket(INTERVAL '$__interval', d + INTERVAL '$__interval', 'Europe/Amsterdam'),
    '(]'  -- We use an inclusive upper bound, because a meter reading on the upper boundary applies to the previous period
  ) bucket
  FROM generate_series($__timeFrom(), $__timeTo(), INTERVAL '$__interval') d
) buckets
LEFT JOIN LATERAL (
  SELECT *
  FROM electricity_data
  WHERE meter_id = $meterId AND "time" <@ bucket
  ORDER BY "time" DESC
  LIMIT 1
) elec ON true
ORDER BY bucket;

You could combine a [`tstzrange`](https://www.postgresql.org/docs/current/rangetypes.html#RANGETYPES-BUILTIN) constructor with a [`generate_series()`](https://www.postgresql.org/docs/current/functions-srf.html#FUNCTIONS-SRF-SERIES). The range type will let you compare your timestamps with a single [`<@`](https://www.postgresql.org/docs/current/functions-range.html) instead of a `between` or a pair of `<` and `>=`. [Demo](https://dbfiddle.uk/aqiDuxiQ) — Zegarek, Feb 10 '23 at 11:19

PostgreSQL: get latest row for each time interval

3 Answers3