54

I have my measurement data stored into the following structure:

CREATE TABLE measurements(
measured_at TIMESTAMPTZ,
val INTEGER
);

I already know that using

(a) date_trunc('hour',measured_at)

AND

(b) generate_series

I would be able to aggregate my data by:

microseconds,
milliseconds
.
.
.

But is it possible to aggregate the data by 5 minutes or let's say an arbitrary amount of seconds? Is it possible to aggregate measured data by an arbitrary multiple of seconds?

I need the data aggregated by different time resolutions to feed them into a FFT or an AR-Model in order to see possible seasonalities.

user1612798
  • 702
  • 1
  • 5
  • 8

10 Answers10

59

You can generate a table of "buckets" by adding intervals created by generate_series(). This SQL statement will generate a table of five-minute buckets for the first day (the value of min(measured_at)) in your data.

select 
  (select min(measured_at)::date from measurements) + ( n    || ' minutes')::interval start_time,
  (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, (24*60), 5) n

Wrap that statement in a common table expression, and you can join and group on it as if it were a base table.

with five_min_intervals as (
  select 
    (select min(measured_at)::date from measurements) + ( n    || ' minutes')::interval start_time,
    (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
  from generate_series(0, (24*60), 5) n
)
select f.start_time, f.end_time, avg(m.val) avg_val 
from measurements m
right join five_min_intervals f 
        on m.measured_at >= f.start_time and m.measured_at < f.end_time
group by f.start_time, f.end_time
order by f.start_time

Grouping by an arbitrary number of seconds is similar--use date_trunc().


A more general use of generate_series() lets you avoid guessing the upper limit for five-minute buckets. In practice, you'd probably build this as a view or a function. You might get better performance from a base table.

select 
  (select min(measured_at)::date from measurements) + ( n    || ' minutes')::interval start_time,
  (select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, ((select max(measured_at)::date - min(measured_at)::date from measurements) + 1)*24*60, 5) n;
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • `generate_series(timestamp, timestamp, interval)` suits this case better, see Bill's answer below for why. – Alexander Gonchiy Nov 23 '16 at 14:38
  • @AlexanderGonchiy: We don't know whether `generate_series(timestamp, timestamp, interval)` was a better choice for this 4-year-old question. There's no version information in the question. Timestamp arguments were introduced in 8.4; earlier versions were still widely used in 2012. Even today, some derived products like GreenPlum and Amazon Redshift don't support timestamp arguments in `generate_series`. – Mike Sherrill 'Cat Recall' Nov 24 '16 at 00:00
  • 1
    This solution makes interval boundaries move depending on the min and max timestamp available in the data. It may not work for a dynamic (aka "live") chart because the labels along the X axis will change. If this is the case, absolute intervals (not bound to min/max data) may work better. – Meglio May 01 '20 at 07:06
16

Catcall has a great answer. My example of using it demonstrates having fixed buckets - in this case 30 minute intervals starting at midnight. It also shows that there can be one extra bucket generated in Catcall's first version and how to eliminate it. I wanted exactly 48 buckets in a day. In my problem, observations have separate date and time columns and I want to average the observations within a 30 minute period across the month for a number of different services.

with intervals as (
    select
        (n||' minutes')::interval as start_time, 
        ((n+30)|| ' minutes')::interval as end_time
    from generate_series(0, (23*60+30), 30) n
)
select i.start_time, o.service, avg(o.o)
from
observations o right join intervals i
on o.time >= i.start_time and o.time < i.end_time
where o.date between '2013-01-01' and '2013-01-31'
group by i.start_time, i.end_time, o.service
order by i.start_time
Julian
  • 1,522
  • 11
  • 26
15

How about

SELECT MIN(val), 
EXTRACT(epoch FROM measured_at) / EXTRACT(epoch FROM INTERVAL '5 min') AS int 
FROM measurements 
GROUP BY int

where '5 min' can be any expression supported by INTERVAL

grisha
  • 169
  • 1
  • 3
  • To make time buckets: `ROUND((EXTRACT(epoch FROM "when") / EXTRACT(epoch FROM INTERVAL '5 min')) * EXTRACT(epoch FROM INTERVAL '5 min')) AS "my_time_bucket"` – Gelldur Sep 27 '22 at 13:35
10

The following will give you buckets of any size, even if they don't aline well with a nice minute/hour/whatever boundary. The value "300" is for a 5 minute grouping, but any value can be substituted:

select measured_at, 
       val, 
       (date_trunc('seconds', (measured_at - timestamptz 'epoch') / 300) * 300 + timestamptz 'epoch') as aligned_measured_at
from measurements;

You can then use whatever aggregate you need around "val", and use "group by aligned_measured_at" as required.

Chris Cogdon
  • 7,481
  • 5
  • 38
  • 30
10

From PostgreSQL v14 on, you can use the date_bin function for that:

SELECT date_bin(
          INTERVAL '5 minutes',
          measured_at,
          TIMESTAMPTZ '2000-01-01'
       ),
       sum(val)
FROM measurements
GROUP BY 1;
geisterfurz007
  • 5,292
  • 5
  • 33
  • 54
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
7

This is based on Mike Sherrill's answer, except that it uses timestamp intervals instead of separate start/end columns.

with intervals as (
    select tstzrange(s, s + '5 minutes') das_interval
    from (select generate_series(min(lower(time_range)), max(upper(time_rage)), '5 minutes') s
          from your_table) x)
select das_interval, your_table.*
from   your_table
right join intervals on time_range && das_interval
order by das_interval;
BillRobertson42
  • 12,602
  • 4
  • 40
  • 57
5

I wanted to look at the past 24 hours of data and count things in hourly increments. I started Cat Recall's solution, which is pretty slick. It's bound to the data, though, rather than just what's happened in the past 24H. So I refactored and ended up with something pretty close to Julian's solution, but with more CTE. So it's sort of the marriage of the 2 answers.

WITH interval_query AS (
    SELECT (ts ||' hour')::INTERVAL AS hour_interval
    FROM generate_series(0,23) AS ts
), time_series AS (
    SELECT date_trunc('hour', now()) + INTERVAL '60 min' * ROUND(date_part('minute', now()) / 60.0) - interval_query.hour_interval AS start_time
    FROM interval_query
), time_intervals AS (
    SELECT start_time, start_time + '1 hour'::INTERVAL AS end_time
    FROM time_series ORDER BY start_time
), reading_counts AS (
    SELECT f.start_time, f.end_time, br.minor, count(br.id) readings
    FROM beacon_readings br
    RIGHT JOIN time_intervals f
                    ON br.reading_timestamp >= f.start_time AND br.reading_timestamp < f.end_time AND br.major = 4
    GROUP BY f.start_time, f.end_time, br.minor
    ORDER BY f.start_time, br.minor
)
SELECT * FROM reading_counts

Note that any additional limiting I wanted in the final query needed to be done in the RIGHT JOIN. I'm not suggesting this is necessarily the best (or even a good approach), but it is something I'm running with (at least at the moment) in a dashboard.

Barrett Clark
  • 262
  • 2
  • 5
4

The Timescale extension for PostgreSQL gives the ability to group by arbitrary time intervals. The function is called time_bucket() and has the same syntax as the date_trunc() function but takes an interval instead of a time precision as first parameter. Here you can find its API Docs. This is an example:

SELECT
  time_bucket('5 minutes', observation_time) as bucket,
  device_id,
  avg(metric) as metric_avg,
  max(metric) - min(metric) as metric_spread
FROM
  device_readings
GROUP BY bucket, device_id;

You may also take a look at the continuous aggregate views if you want the 'grouped by an interval' views be updated automatically with new ingested data and if you want to query these views on a frequent basis. This can save you a lot of resources and will make your queries a lot faster.

Tom Böttger
  • 585
  • 7
  • 12
3

I've taken a synthesis of all the above to try and come up with something slightly easier to use;

create or replace function interval_generator(start_ts timestamp with TIME ZONE, end_ts timestamp with TIME ZONE, round_interval INTERVAL)
    returns TABLE(start_time timestamp with TIME ZONE, end_time timestamp with TIME ZONE) as $$
BEGIN
return query
        SELECT
            (n)       start_time,
            (n + round_interval) end_time
        FROM generate_series(date_trunc('minute', start_ts), end_ts, round_interval) n;
END
$$
    LANGUAGE 'plpgsql';

This function is a timestamp abstraction of Mikes answer, which (IMO) makes things a little cleaner, especially if you're generating queries on the client end.

Also using an inner join gets rid of the sea of NULLs that appeared previously.

with intervals as (select * from interval_generator(NOW() - INTERVAL '24 hours' , NOW(), '30 seconds'::INTERVAL))
select f.start_time, m.session_id, m.metric, min(m.value) min_val, avg(m.value) avg_val, max(m.value) max_val
from ts_combined as m
inner JOIN intervals f
    on m.time >= f.start_time and m.time < f.end_time
GROUP BY f.start_time, f.end_time, m.metric, m.session_id
ORDER BY f.start_time desc

(Also for my purposes I added in a few more aggregation fields)

Bolster
  • 7,460
  • 13
  • 61
  • 96
2

Perhaps, you can extract(epoch from measured_at) and go from that?

Michael Krelin - hacker
  • 138,757
  • 24
  • 193
  • 173