how can I group data by batches of 5 seconds with postgres?

Question

right now I am grouping data by the minute:

SELECT
    date_trunc('minute', ts) ts,
    ...
FROM binance_trades
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts

but I would like to group by 5 seconds.

I found this question: Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)

with the answer:

SELECT date_bin(
          INTERVAL '5 minutes',
          measured_at,
          TIMSTAMPTZ '2000-01-01'
       ),
       sum(val)
FROM measurements
GROUP BY 1;

I am using v13.3, so date_bin doesn't exist on my version of Postgres.

I don't quite understand the other answers in that question as well.

Output of EXPLAIN:

Limit  (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.072..76239.529 rows=13 loops=1)
  ->  GroupAggregate  (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.071..76239.526 rows=13 loops=1)
        Group Key: j.win_start
        ->  Sort  (cost=6379566.36..6380086.58 rows=208088 width=28) (actual time=76238.000..76238.427 rows=5335 loops=1)
              Sort Key: j.win_start
              Sort Method: quicksort  Memory: 609kB
              ->  Nested Loop  (cost=1000.00..6356204.58 rows=208088 width=28) (actual time=23971.722..76237.055 rows=5335 loops=1)
                    Join Filter: (j.ts_win @> b.ts)
                    Rows Removed by Join Filter: 208736185
                    ->  Seq Scan on binance_trades b  (cost=0.00..3026558.81 rows=16006783 width=28) (actual time=0.033..30328.674 rows=16057040 loops=1)
                          Filter: ((instrument)::text = ''ETHUSDT''::text)
                          Rows Removed by Filter: 126872903
                    ->  Materialize  (cost=1000.00..208323.11 rows=13 width=30) (actual time=0.000..0.001 rows=13 loops=16057040)
                          ->  Gather  (cost=1000.00..208323.05 rows=13 width=30) (actual time=3459.850..3461.076 rows=13 loops=1)
                                Workers Planned: 2
                                Workers Launched: 2
                                ->  Parallel Seq Scan on scalper_5s_intervals j  (cost=0.00..207321.75 rows=5 width=30) (actual time=2336.484..3458.397 rows=4 loops=3)
                                      Filter: ((win_start >= ''2021-08-20 00:00:00''::timestamp without time zone) AND (win_start <= ''2021-08-20 00:01:00''::timestamp without time zone))
                                      Rows Removed by Filter: 5080316
Planning Time: 0.169 ms
Execution Time: 76239.667 ms

Jim Jones · Accepted Answer · 2021-08-27T12:55:57.720

1

If you're only interested in a five seconds distribution rather than the exact sum of a given time window (when the events really happened), you can round the timestamp in five seconds using date_trunc() and mod() and then group by it.

SELECT
  date_trunc('second',ts)-
    MOD(EXTRACT(SECOND FROM date_trunc('second',ts))::int,5)*interval'1 sec',
  SUM(price)
FROM binance_trades
WHERE instrument = 'ETHUSDT' AND 
      ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY 1

Here I assume that ts and instrument are properly indexed.

However, if this is a sensitive analysis regarding time accuracy and you cannot afford rounding the timestamps, try to create the time window within a CTE (or subquery), then in the outer query create a tsrange and join binance_trades.ts with it using the containment operator @>.

WITH j AS (
  SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
  FROM generate_series(
    (SELECT min(date_trunc('second',ts)) FROM binance_trades
     WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
    (SELECT max(date_trunc('second',ts)) FROM binance_trades
     WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i))
SELECT
    j.win_start ts,
    SUM(price)
FROM j
JOIN binance_trades b ON ts_win @> b.ts
GROUP BY j.win_start ORDER BY j.win_start
LIMIT 5;

There is a caveat though: this approach will get pretty slow if you're creating 5 second series from a large time window. It's due to the fact that you'll have to join these newly created records with table binance_trades without an index. To overcome this issue you can create a temporary table and index it:

CREATE UNLOGGED TABLE scalper_5s_intervals AS
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
  (SELECT min(date_trunc('second',ts)) FROM binance_trades
   WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
  (SELECT max(date_trunc('second',ts)) FROM binance_trades
   WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i);

CREATE INDEX idx_ts_5sec ON scalper_5s_intervals USING gist (ts_win);
CREATE INDEX idx_ts_5sec_winstart ON scalper_5s_intervals USING btree(win_start);

UNLOGGED tables are much faster than regular ones, but keep in mind that they're not crash safe. See documentation (emphasis mine):

If specified, the table is created as an unlogged table. Data written to unlogged tables is not written to the write-ahead log (see Chapter 29), which makes them considerably faster than ordinary tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.

After that your query will become much faster than the CTE approach, but still significantly slower than the first query with the rounded the timestamps.

SELECT
    j.win_start ts,
    SUM(price)
FROM scalper_5s_intervals j
JOIN binance_trades b ON ts_win @> b.ts
WHERE j.win_start BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY j.win_start ORDER BY j.win_start

Demo: db<>fiddle

edited Aug 27 '21 at 12:55

answered Aug 26 '21 at 11:06

Jim Jones

18,404
3
35
44

ok, so I get it now: you have to generate the time range manually, there is nothing built in for that. thanks! – Thomas Aug 26 '21 at 11:28
@Thomas you could also use the window functions `lag()` and `lead()` instead of creating a `tsrange`, but I don't think it would make things easier. I still haven't tested `date_bin` - will do as soon as pg14 is released ;). Cheers – Jim Jones Aug 26 '21 at 11:31
there is an issue with this though: the SELECT max(ts), min(ts) makes the query extremely slow since we're in the millions of rows in that situation. Since I have a where clause giving a time range, could the range be applied to the series? – Thomas Aug 26 '21 at 12:37
I added the TS range to the series, in order to limit how much is covered; but it looks like I also need to keep the time range in the WHERE in the main query; having it in the timestamp series doesn't seem to be enough – Thomas Aug 26 '21 at 12:43
@Thomas creating a 5 sec interval window for a large timespan can be indeed quite slow. Specially when you need to join them without an index. Would a temporary table be an option for you? – Jim Jones Aug 26 '21 at 12:48
@Thomas please check the last query in this fiddle: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=213bb3bf634e16deeb89e4f516d0d431 – Jim Jones Aug 26 '21 at 13:09
ah yes! that would definitely work. We do continuous capture but don't keep old data, I could have a process just generating the timestamps ahead, and removing old ones, and then all queries could refer to that table. Thanks Jim. – Thomas Aug 26 '21 at 13:29
the last version performs extremely slow as I am adding a time range. I made a large table of ranges (until 2023) to simulate how it may end up being; I put my test there: https://pastebin.com/Y4f1mHRA – Thomas Aug 26 '21 at 15:50
@Thomas please add the query plan (`EXPLAIN ANALYSE` ), you're most likely missing an index, e.g. `win_start` -> `CREATE INDEX idx_win_start ON tmp_binance_trades_interval USING btree (win_start);` – Jim Jones Aug 26 '21 at 17:54
I edited the question with the output of explain – Thomas Aug 26 '21 at 18:38
@Thomas the query plan suggests you need an index on `instrument`: `Seq Scan on binance_trades b (cost=0.00..3026558.81 rows=16006783 width=28) (actual time=0.033..30328.674 rows=16057040 loops=1) Filter: ((instrument)::text = ''ETHUSDT''::text)`. I must say it has little to do with your original question. – Jim Jones Aug 26 '21 at 20:21
that's true, I guess we drifted as I was trying to make the solution work (I've an index on instrument btw). I think I should have mentioned that the data set is very large when we got started. Either I'm doing something completely wrong, or Postgres is just not a good fit for this type of queries because it is so slow that it's not remotely usable (I'm at 1m30 per query, I need to do this roughly 20x per minute). – Thomas Aug 26 '21 at 20:59
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236461/discussion-between-thomas-and-jim-jones). – Thomas Aug 26 '21 at 21:02
@Thomas at least in my pc with the time window you provided the query runs in ~ 0.2 ms (this value might change with "real" data). In the query plan you can see that there are several sequential scans, which in such a large table gets quite expensive. Could you create these indexes and try again? https://dbfiddle.uk/?rdbms=postgres_13&fiddle=1a07da33ef16d997cf8426705f1ee429 – Jim Jones Aug 27 '21 at 07:21

how can I group data by batches of 5 seconds with postgres?

1 Answers1

Linked