2

I'd like to update a certain column in a table based on the difference in a another column value between neighboring rows in PostgreSQL.

Here is a test setup:

CREATE TABLE test(
   main INTEGER,
   sub_id INTEGER,
   value_t INTEGER);

INSERT INTO test (main, sub_id, value_t)
VALUES
    (1,1,8),
    (1,2,7),
    (1,3,3),
    (1,4,85),
    (1,5,40),
    (2,1,3),
    (2,2,1),
    (2,3,1),
    (2,4,8),
    (2,5,41);

My goal is to determine in each group main starting from sub_id 1 which value in diff exceeds a certain threshold (e.g. <10 or >-10) by checking in ascending order by sub_id. Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.

enter image description here

Should I use a loop or are there smarter solutions?

The task description in pseudocode:

FOR i in GROUP [PARTITION BY main ORDER BY sub_id]:
    DO until diff > 10 OR diff <-10
        SET newval = 1 AND LEAD(newval) = 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
DrSnuggles
  • 217
  • 1
  • 14

3 Answers3

1

Your question was hard to comprehend, the "value_t" column was irrelevant to the question, and you forgot to define the "diff" column in your SQL.

Anyhow, here's your solution:

WITH data AS (
  SELECT main, sub_id, value_t
       , abs(value_t
             - lead(value_t) OVER (PARTITION BY main ORDER BY sub_id)) > 10 is_evil
  FROM test
)
SELECT main, sub_id, value_t
     , CASE max(is_evil::int)
            OVER (PARTITION BY main ORDER BY sub_id
                  ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
       WHEN 1 THEN NULL ELSE 1 END newval
FROM data;

I'm using a CTE to prepare the data (computing whether a row is "evil"), and then the "max" window function is used to check if there were any "evil" rows before the current one, per partition.

Jonathan Jacobson
  • 1,441
  • 11
  • 20
1

Basic SELECT

As fast as possible:

SELECT *, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
   SELECT *, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
   FROM   test
   ) sub;
Fine points
  • Your thought model evolves around the window function lead(). But its counterpart lag() is a bit more efficient for the purpose, since there is no off-by-one error when including the row before the big gap. Alternatively, use lead() with inverted sort order (ORDER BY sub_id DESC).

  • To avoid NULL for the first row in the partition, provide value_t as default as 3rd parameter, which makes the diff 0 instead of NULL. Both lead() and lag() have that capability.

  • diff BETWEEN -10 AND 10 is slightly faster than @diff < 11 (clearer and more flexible, too). (@ being the "absolute value" operator, equivalent to the abs() function.)

  • bool_or() or bool_and() in the outer window function is probably cheapest to mark all rows up to the big gap.

Your UPDATE

Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.

Again, as fast as possible.

UPDATE test AS t
SET    newval = 1
FROM  (
   SELECT main, sub_id
        , bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
   FROM (
      SELECT main, sub_id
           , value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
      FROM   test
      ) sub
   ) u
WHERE (t.main, t.sub_id) = (u.main, u.sub_id)
AND    u.flag;
Fine points
  • Computing all values in a single query is typically substantially faster than a correlated subquery.

  • The added WHERE condition AND u.flag makes sure we only update rows that actually need an update.
    If some of the rows may already have the right value in newval, add another clause to avoid those empty updates, too: AND t.newval IS DISTINCT FROM 1 See:

  • SET newval = 1 assigns a constant (even though we could use the actually calculated value in this case), that's a bit cheaper.

db<>fiddle here

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

EXISTS on an aggregating subquery:


UPDATE test u
SET value_t = NULL
WHERE EXISTS (
        SELECT * FROM (
                SELECT main,sub_id
                , value_t , ABS(value_t - lag(value_t)
                       OVER (PARTITION BY main ORDER BY sub_id) ) AS absdiff
                FROM test
                ) x
        WHERE x.main = u.main
        AND x.sub_id <= u.sub_id
        AND x.absdiff >= 10
        )
        ;

SELECT * FROM test
ORDER BY main, sub_id;

Result:


UPDATE 3
 main | sub_id | value_t 
------+--------+---------
    1 |      1 |       8
    1 |      2 |       7
    1 |      3 |       3
    1 |      4 |        
    1 |      5 |        
    2 |      1 |       3
    2 |      2 |       1
    2 |      3 |       1
    2 |      4 |       8
    2 |      5 |        
(10 rows)
wildplasser
  • 43,142
  • 8
  • 66
  • 109