Aggregate timeseries data with value array

Question

I have sensor data in a table by timestamp with multiple values in an array. E.g.:

CREATE TABLE test_raw (
  ts timestamp without time zone NOT NULL,
  values real[]
);
INSERT INTO test_raw VALUES
  ('2020-7-14 00:00:00', ARRAY[1, 10]),
  ('2020-7-14 00:01:00', ARRAY[2, 20, 30]),
  ('2020-7-14 00:20:00', ARRAY[3, NULL, 30, 40]),
  ('2020-7-14 00:23:00', ARRAY[9, NULL, 50, 80]),
  ('2020-7-14 00:10:00', ARRAY[3, 30, 40]),
  ('2020-7-14 00:11:00', ARRAY[3, 30, NULL, 50])
;

The array corresponds to different metrics collected by a device, e.g., values[1] might be temperature, values[2] might be humidity, etc. The full schema has additional columns (e.g. device ID) that indicate what the array contains.

I'd now like to create an aggregate/rollup table that has, say, the average over 10 minutes. If values were a scalar and not an array, I'd write the following view (which I'd use to populate the rollup table):

CREATE VIEW test_raw_10m AS
SELECT
  floor(extract(epoch FROM ts)/600)*600 as ts,
  AVG(value) /* scalar value! */
FROM test_raw
GROUP BY ts;

But it's not so simple with a values array. I saw the answer to a very closely related question: Pairwise array sum aggregate function? This leads me to the following, which seems overly complicated:

WITH test_raw_10m AS (
  SELECT floor(extract(epoch FROM ts)/600)*600 as ts, values
  FROM test_raw
)
SELECT
  t.ts,
  ARRAY( SELECT
    AVG(value) as value
    FROM test_raw_10m tt, UNNEST(tt.values) WITH ORDINALITY x(value, rn)
    WHERE tt.ts = t.ts
    GROUP by x.rn
    ORDER by x.rn) AS values
FROM test_raw_10m AS t
GROUP BY ts
ORDER by ts
;

My question: Is there a better way to do this?

For completeness, here's the result given the above sample data:

     ts     |     values
------------+----------------
 1594684800 | {1.5,15,30}
 1594685400 | {3,30,40,50}
 1594686000 | {6,NULL,40,60}
(3 rows)

and here's the query plan:

                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Group  (cost=119.37..9490.26 rows=200 width=40)
   Group Key: t.ts
   CTE test_raw_10m
     ->  Seq Scan on test_raw  (cost=0.00..34.00 rows=1200 width=40)
   ->  Sort  (cost=85.37..88.37 rows=1200 width=8)
         Sort Key: t.ts
         ->  CTE Scan on test_raw_10m t  (cost=0.00..24.00 rows=1200 width=8)
   SubPlan 2
     ->  Sort  (cost=46.57..46.82 rows=100 width=16)
           Sort Key: x.rn
           ->  HashAggregate  (cost=42.00..43.25 rows=100 width=16)
                 Group Key: x.rn
                 ->  Nested Loop  (cost=0.00..39.00 rows=600 width=12)
                       ->  CTE Scan on test_raw_10m tt  (cost=0.00..27.00 rows=6 width=32)
                             Filter: (ts = t.ts)
                       ->  Function Scan on unnest x  (cost=0.00..1.00 rows=100 width=12)

You can introduce own aggregate function over a arrays with requested behaviour - but probably that is all - for this case I am not sure, if custom aggregation can be faster than your query - probably not - but it depends on data. — Pavel Stehule, Jul 15 '20 at 04:01

score 0 · Answer 1 · answered Jul 15 '20 at 06:10

The following query is significantly faster on my real dataset if I do partial updates by changing FROM test_raw to something like FROM test_raw WHERE ts >= <some timestamp> (in both queries):

SELECT bucket as ts, ARRAY_AGG(v)
FROM (
  SELECT to_timestamp(floor(extract(epoch FROM ts)/600)*600) as bucket, AVG(values[i]) AS v
  FROM (SELECT ts, generate_subscripts(values, 1) AS i, values FROM test_raw) AS foo
  GROUP BY bucket, i
  ORDER BY bucket, i
) bar
GROUP BY bucket;

I believe the ORDER BY bucket, i is not necessary, but I'm not sure.

                                                                   QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
-
 GroupAggregate  (cost=228027.62..241630.12 rows=200 width=40) (actual time=0.948..1.209 rows=3 loops=1)
   Group Key: (to_timestamp((floor((date_part('epoch'::text, foo.ts) / '600'::double precision)) * '600'::double precision)))
   ->  GroupAggregate  (cost=228027.62..241027.62 rows=40000 width=20) (actual time=0.826..1.099 rows=11 loops=1)
         Group Key: (to_timestamp((floor((date_part('epoch'::text, foo.ts) / '600'::double precision)) * '600'::double precision))), foo.i
         ->  Sort  (cost=228027.62..231027.62 rows=1200000 width=44) (actual time=0.773..0.870 rows=20 loops=1)
               Sort Key: (to_timestamp((floor((date_part('epoch'::text, foo.ts) / '600'::double precision)) * '600'::double precision))), foo.i
               Sort Method: quicksort  Memory: 19kB
               ->  Subquery Scan on foo  (cost=0.00..33031.00 rows=1200000 width=44) (actual time=0.165..0.619 rows=20 loops=1)
                     ->  ProjectSet  (cost=0.00..6031.00 rows=1200000 width=44) (actual time=0.131..0.312 rows=20 loops=1)
                           ->  Seq Scan on test_raw  (cost=0.00..22.00 rows=1200 width=40) (actual time=0.034..0.070 rows=6 loops=1)
 Planning Time: 0.525 ms
 Execution Time: 1.504 ms
(12 rows)

Aggregate timeseries data with value array

1 Answers1