How to efficiently compute in step 1 differences between columns and in step 2 aggregate those differences?

Question

This is a follow-up question to this Is there a way to use functions that can take multiple columns as input (e.g. `GREATEST`) without typing out every column name?, where I asked only about the second part of my problem. The feedback was that the data model is most likely not appropriate.

I was thinking again about the data model but still, have trouble figuring out a good way to do what I want.

The complete problem is as follows:

I got time series data for multiple technical devices with columns like energy_consumption and voltage. Furthermore, I got columns with sensitivities towards multiple external factors for each device which I just added as additional columns (denoted with the cc_ in the example).

There are queries where I want to operate on the raw sensitivities. However, there are also queries for which I need to take first some differences such as cc_a - cc_b and cc_b -cc_c and then compute the max of those differences. The combinations for which the differences are to be computed are a predefined subset (around 30) of all possible combinations. The set of combinations that is of interest might change in the future so that for different time intervals different combinations have to be applied (e.g. from 2022-01-01 to 2024-12-31 take combination set A and from 2025-01-01 to ... take combination set B). However, it is very unlikely that the combination change very often.

Here is an example of how I am doing it at the moment

CREATE TEMP TABLE foo (device_id int, voltage int, energy_consumption int, cc_a int, cc_b int, cc_c int);
INSERT INTO foo VALUES (3, 12, 5, '1', '2', '3'), (4, 6, 3, '15', '4', '100');

WITH diff_table AS (
    SELECT 
        id, 
        (cc_a - cc_b) as diff_ab,
        (cc_a - cc_c) as diff_ac,
        (cc_b - cc_c) as diff_bc
    FROM foo
)

SELECT 
    id, 
    GREATEST(diff_ab, diff_ab, diff_bc) as max_cc
FROM diff_table

Since I got more than 100 sensitivities and also differences I am looking for a way how to do this efficiently, both computationally and in terms of typical query length.

What would be a good data model to perform such operations?

for 3 columns there are 3 non-ordered combinations of 2 elements. for 100 columns the number of combinations is **4950**. Are you sure you want to take this path? P.S. There was another feedback, that paraphs you're not using the right product/service :-) — David דודו Markovitz, Aug 28 '22 at 10:14
As for the question itself, you'll have to clarify your scenario. It seems that not all cc_ columns are equal. Is there a distinct subset of cc_ columns that you compute the difference on? What's the pattern of choosing couples of columns within that subset? — David דודו Markovitz, Aug 28 '22 at 10:34
no this isn't a valid question here maybe codereview. finally such question can't be answered, a dynamic approach with information schema, that generates the query, can be done, but only because there are 100 of columns which indicates a bad design. — nbk, Aug 28 '22 at 10:54
This is a reply for @Alex: I fail to see the reason to use CTE here (unless you reuse the diff expressions). It just forces you to define 2 lists. Put the diff expressions inside the function: `greatest(cc_a - cc_b, cc_a - cc_c, ...)` — David דודו Markovitz, Aug 28 '22 at 10:59
@nbk, given the additional information, it actually seems like a reasonable design. The cc_ columns are not the timeseries metrics - they are an enrichment data for each read. The query doesn't try to manipulate all the cc_ columns, just a subset. Even if the data was unpivoted, it would still require specifically referencing each cc_ column (now cc_ value). — David דודו Markovitz, Aug 28 '22 at 11:05
@DavidדודוMarkovitz he wants to know how to make his cte with hundreds of columns without writing them explicitly. sql is requiered to know before hand what you want to see a dynamic approach as suggested in my coment will work, even with sophisticated flowcontrol. Still hundreds of column never are a good design, you would cluster them in their own tables, so that you only need to select all columns from table a when the information is there or even two or three of them. This would indicate a good design — nbk, Aug 28 '22 at 11:15
He has around 100 cc_ columns. There are around 30 diff calculations. The couples of columns which those calculations are applies on are determined by a business logic outside the DB. There in no way to avoid mention them explicitly. Columnar storage addresses the scenario you describe at the end of your comment. — David דודו Markovitz, Aug 28 '22 at 11:25

Johan Buret · Accepted Answer · 2022-09-02T10:15:52.623

The solution I type below assumes all pairings are considered, and the you don't want the points where these are reached.


CREATE TABLE sources (
 source_id int 
,source_name varchar(10)
,PRIMARY KEY(source_id))

CREATE TABLE foo_values(
 device_id int not null --device_id for "foo"
,source_id int -- you may change that with a foreign key
,value int
,CONSTRAINT fk_source_id
      FOREIGN KEY(source_id ) 
      REFERENCES sources(source_id ) )

With the exemple set you gave

INSERT INTO sources ( source_id, source_name ) VALUES
 (1,'cc_a')
,(2,'cc_b')
,(3,'cc_c')
-- and so on
 

INSERT INTO foo_values (device_id,source_id, value ) VALUES
 (3,1,1),(3,2,2),(3,3,3)
,(4,1,15),(4,2,4),(4,2,100)

doing this way, the query will be

SELECT device_id
     , MAX(value)-MIN(value)  as greatest_diff
FROM foo_values 
group by device_id

Bonus : with such a schema, you can even tell where the maximum and minimum are reached

WITH ranked as (
SELECT 
     f.device_id
    ,f.value
    ,f.source_id
    ,RANK() OVER (PARTITION BY f.device_id ORDER BY f.value ) as low_first
    ,RANK() OVER (PARTITION BY f.device_id ORDER BY f.value DESC) as high_first
 FROM foo_values as f)
 SELECT h.device_id
      , hs.source_name as source_high
      , ls.source_name as source_low
      , h.value as value_high
      , l.value as value_low
      , h.value - l.value as greatest_diff
 FROM ranked l 
 INNER JOIN ranked h 
     on l.device_id = h.device_id
 INNER JOIN sources ls
     on ls.source_id = l.source_id
 INNER JOIN sources hs
     on hs.source_id = h.source_id
 WHERE l.low_first =1 AND h.high_first = 1

Here is a fiddle for this solution.

EDIT : since you need to control the pairings, you must add a table that list them:

CREATE TABLE high_low_source
(high_source_id int
,low_source_id int
, constraint fk_low
     FOREIGN KEY(low_source_id ) 
      REFERENCES sources(source_id )
,constraint fk_high
     FOREIGN KEY(high_source_id ) 
      REFERENCES sources(source_id )
);


INSERT INTO high_low_source VALUES
(1,2),(1,3),(2,3)

The query looking for the greatest difference becomes :

 SELECT h.device_id
      , hs.source_name as source_high
      , ls.source_name as source_low
      , h.value as value_high
      , l.value as value_low
      , h.value - l.value as my_diff
      , RANK() OVER (PARTITION BY h.device_id ORDER BY (h.value - l.value) DESC) as greatest_first
 FROM foo_values l 
 INNER JOIN foo_values h 
     on l.device_id = h.device_id
 INNER JOIN high_low_source hl
     on hl.low_source_id = l.source_id
     AND hl.high_source_id = h.source_id
 INNER JOIN sources ls
     on ls.source_id = l.source_id
 INNER JOIN sources hs
     on hs.source_id = h.source_id
  ORDER BY device_id, greatest_first

With the values you have listed, there is a tie for device 3.

Extended fiddle

Thanks! One step that is missing is that I want to compute the differences only between predefined combinations of what in your table is specified in the column `source_name`. Your query would compute this over all possible combinations if I am not mistaken. — Alex, Sep 02 '22 at 09:19

How to efficiently compute in step 1 differences between columns and in step 2 aggregate those differences?

1 Answers1

Linked