3

I'm using PostgreSQL 9.6. I have a table like this:

mac   sn         loc   time     date      vin1    vin2    vin3
1a34 4as11111111 aaaa  7:06:18  1/1/2018  447.42  472.32  682.59
1a34 4as11111111 aaaa  7:06:43  1/1/2018  455.97  476.25  682.59
1a34 4as11111111 aaaa  7:07:35  1/1/2018  470.88  484.2   682.5

I need to calculate the average of the vin1, vin2, vin3 within time intervals of 300 sec (5 min). For example, starting from the first time (7:06:18 - 7:11:18), for the dates in range. I can select the data I need with this query:

select * from table
where sn='4as11111111' and date between '2018-01-01' and '2018-01-02';

But I don't know how to group it by 300 sec time intervals and calculate average for vin1, vin2, vin3 columns for those 5 min interval, to get something like this:

mac  sn          loc     time     date      vin1_av  vin2_av  vin3_av
1a34 4as11111111 aaaa   7:06:18  1/1/2018  450.0    480.32   600.59
1a34 4as11111111 aaaa   7:11:18  1/1/2018  460.0    490.25   782.59
1a34 4as11111111 aaaa   7:16:18  1/1/2018  470.88   500.2    600.5

Any help would be greatly appreciated.

lugger1
  • 1,843
  • 3
  • 22
  • 31
  • What about time intervals without any existing rows? Include with result `0`or `NULL` or no row at all? Can the range span more than one day? Also, your example data & result do not match. Nor does the query (different date). – Erwin Brandstetter May 10 '18 at 02:32
  • All the rows are guaranteed to have data, there could be 0.0V but not NULL. There could be time slots with no data, you are right, so those slots should be ignored. Span can be weeks, months. Sorry about the query dates (I fixed it), they don't match the results I want, I just copied one of my experiment queries as an example – lugger1 May 10 '18 at 16:23
  • Thank you Erwin. My real table has a first column "mac" (see the edited question, sorry I didn't include it to the original post, didn't think it matters), and It changes the query so that I can't make it work using your solution. I feel like it is correct, but still can't figure out the changes. – lugger1 May 21 '18 at 22:29
  • Looks like `mac` and `loc` are always the same for the same `sn`. Is that so? If not, how would that reflect on the desired result? I updated the solution on an educated guess. – Erwin Brandstetter May 21 '18 at 23:49
  • @Erwin: No, mac and loc could be different, but it doesn't matter much as I mostly interested in selection by sn. It's weired, I played with your query at dbfiddle (https://dbfiddle.uk/?rdbms=postgres_10&fiddle=99bcc8db71cab3abc05bc039c77d7aac) and it perfectly works there with the example data, but when I try it with real data (as in dbfiddle now), there is "no rows returned" (see dbfiddle link). And on my local db I get this "ERROR: ORDER/GROUP BY expression not found in targetlist". So again, I missed something simple there. – lugger1 May 22 '18 at 18:38
  • Right, thanks a lot, Erwin. My bad. – lugger1 May 22 '18 at 22:18
  • Is it possible to do it in case some column appear to be empty( vin1 without any data, for example), for selected dates interval? – lugger1 May 30 '18 at 21:54
  • the aggregate function `avg()` ignores NULL values. If all rows have NULL, the result is NULL. You might enclose it in `COALESCE()` to return 0 (or anything else) instead. – Erwin Brandstetter May 31 '18 at 16:27

1 Answers1

7

DB design

While you can work with separate date and time columns, there is really no advantage over a single timestamp column. I would adapt:

ALTER TABLE tbl ADD column ts timestamp;
UPDATE tbl SET ts = date + time;  -- assuming actual date and time types
ALTER TABLE tbl DROP column date, DROP column time;

If date and time are not actual date and time data types, use to_timestamp(). Related:

Query

Then the query is a bit simpler:

SELECT *
FROM  (
   SELECT sn, generate_series(min(ts), max(ts), interval '5 min') AS ts
   FROM   tbl
   WHERE  sn = '4as11111111'
   AND    ts >= '2018-01-01'
   AND    ts <  '2018-01-02'
   GROUP  BY 1
   ) grid
CROSS  JOIN LATERAL (
   SELECT round(avg(vin1), 2) AS vin1_av
        , round(avg(vin2), 2) AS vin2_av
        , round(avg(vin3), 2) AS vin3_av
   FROM   tbl
   WHERE  sn =  grid.sn
   AND    ts >= grid.ts
   AND    ts <  grid.ts + interval '5 min'
   ) avg;

db<>fiddle here

Generate a grid of start times in the first subquery grid, running from the first to the last qualifying row in the given time frame.

Join to rows that fall in each partition with a LATERAL join and immediately aggregate averages in the subquery avg. Due to the aggregates, it always returns a row even if no entries are found. Averages default to NULL in this case.

The result includes all time slots between the first and last qualifying row in the given time frame. Various other result compositions would make sense, too. Like including all times slots in the given time frame or just time slots with actual values. All possible, I had to pick one interpretation.

Index

At least have this multicolumn index:

CRATE INDEX foo_idx ON tbl (sn, ts);

Or on (sn, ts, vin1, vin2, vin3) to allow index-only scans - if some preconditions are met and especially if table rows are much wider than in the demo.

Closely related:

Based on your original table

As requested and clarified in the comment, and later updated again in the question to include the columns mac and loc. I assume you want separate averages per (mac, loc).

date and time are still separate columns, vin* columns are type float, and exclude time slots without rows:

The updated query also moves the set-returning function generate_series() to the FROM list, which is cleaner before Postgres 10:

SELECT t.mac, sn.sn, t.loc, ts.ts::time AS time, ts.ts::date AS date
     , t.vin1_av, t.vin2_av, t.vin3_av
FROM  (SELECT text '4as11111111') sn(sn)  -- provide sn here once
CROSS  JOIN LATERAL (
   SELECT min(date+time) AS min_ts, max(date+time) AS max_ts
   FROM   tbl
   WHERE  sn = sn.sn
   AND    date+time >= '2018-01-01 0:0'   -- provide time frame here
   AND    date+time <  '2018-01-02 0:0'
   ) grid
CROSS  JOIN LATERAL generate_series(min_ts, max_ts, interval '5 min') ts(ts)
CROSS  JOIN LATERAL (
   SELECT mac, loc
        , round(avg(vin1)::numeric, 2) AS vin1_av  -- cast to numeric for round()
        , round(avg(vin2)::numeric, 2) AS vin2_av  -- but rounding is optional
        , round(avg(vin3)::numeric, 2) AS vin3_av
   FROM   tbl
   WHERE  sn = sn.sn
   AND    date+time >= ts.ts
   AND    date+time <  ts.ts + interval '5 min'
   GROUP  BY mac, loc
   HAVING count(*) > 0  -- exclude empty slots
   ) t;

Create a multicolumn expression index to support this:

CRATE INDEX bar_idx ON tbl (sn, (date+time));

db<>fiddle here

But I would much rather use timestamp all along.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you very much, Erwin. Is it possible to do what I want with existing DB, with separate date and time columns? There are indexes for sn_idx and date_idx, as those queries are happening frequently, but vin1 ets. are all float and no indexed. And you are right, I don't need an empty time slots filled with NULLs (like at some part of the day it could be no actual values). – lugger1 May 10 '18 at 16:20
  • @lugger1: I added another solution. – Erwin Brandstetter May 11 '18 at 01:00
  • Thank you Erwin. When I'm trying this in postgres, I'm getting the "ERROR: ORDER/GROUP BY expression not found in targetlist" for some reason. Am I missing something? – lugger1 May 11 '18 at 18:34
  • @lugger1: I tested with your version 9.6 and it works. See the added fiddle. – Erwin Brandstetter May 12 '18 at 02:39
  • It was working, but now I'm getting ""ERROR: ORDER/GROUP BY expression not found in targetlist" error. I found that query like "select sn, generate_series(min(date+time), max(date+time), interval '5 min') AS ts from tbl where sn = '4as11111111' and date+time >= '2018-01-01 0:0' and date+time < '2018-01-01 10:0' group by 1 limit 10;" - works, but the same query without "limit 10" (even output is 8 rows, not 10) gives "ORDER/GROUP BY expression not found in targetlist" error. It seems that this is related to the sn_index that was created to improve the speed. Any suggestions how to fix? – lugger1 Jun 04 '18 at 21:30
  • @lugger1: That does not seem to make sense. Neither the existence of an index nor the added `LIMIT` clause should cause or prevent such an error. Are you running the latest point release of pg 9.6 (currently 9.6.9)? Or maybe the SRF in the SELECT list has side effects. (That was sanitized in pg 10.) I added a cleaner version for pg 9.6. – Erwin Brandstetter Jun 05 '18 at 02:49
  • The DB was updated to 10.4 some time ago, and the only change that was made since the query stopped working was the creation of the index (sn_idx). I found here (https://stackoverflow.com/questions/50598053/postgresql-query-without-where-only-order-by-and-limit-doesnt-use-index) some hints about the difference in behavior (index prevents full scan when "limit #" is used), but how to fix this, I don't know. – lugger1 Jun 05 '18 at 03:09
  • Thank you very much, Erwin. This new version works great! – lugger1 Jun 05 '18 at 16:00