How to identify rows per group before a certain value gap?

Question

I'd like to update a certain column in a table based on the difference in a another column value between neighboring rows in PostgreSQL.

Here is a test setup:

CREATE TABLE test(
   main INTEGER,
   sub_id INTEGER,
   value_t INTEGER);

INSERT INTO test (main, sub_id, value_t)
VALUES
    (1,1,8),
    (1,2,7),
    (1,3,3),
    (1,4,85),
    (1,5,40),
    (2,1,3),
    (2,2,1),
    (2,3,1),
    (2,4,8),
    (2,5,41);

My goal is to determine in each group main starting from sub_id 1 which value in diff exceeds a certain threshold (e.g. <10 or >-10) by checking in ascending order by sub_id. Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.

Should I use a loop or are there smarter solutions?

The task description in pseudocode:

FOR i in GROUP [PARTITION BY main ORDER BY sub_id]:
    DO until diff > 10 OR diff <-10
        SET newval = 1 AND LEAD(newval) = 1

`After locating this value I want to overwrite all following numbers` <<-- use EXISTS(...) — wildplasser, Nov 23 '20 at 10:40
Your code examples logic does not match the description and expected outcome above. Please clarify this. — clamp, Nov 23 '20 at 12:12

Jonathan Jacobson · Answer 1 · 2020-11-23T20:33:12.387

1

Your question was hard to comprehend, the "value_t" column was irrelevant to the question, and you forgot to define the "diff" column in your SQL.

Anyhow, here's your solution:

WITH data AS (
  SELECT main, sub_id, value_t
       , abs(value_t
             - lead(value_t) OVER (PARTITION BY main ORDER BY sub_id)) > 10 is_evil
  FROM test
)
SELECT main, sub_id, value_t
     , CASE max(is_evil::int)
            OVER (PARTITION BY main ORDER BY sub_id
                  ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
       WHEN 1 THEN NULL ELSE 1 END newval
FROM data;

I'm using a CTE to prepare the data (computing whether a row is "evil"), and then the "max" window function is used to check if there were any "evil" rows before the current one, per partition.

edited Nov 23 '20 at 20:33

answered Nov 23 '20 at 13:03

Jonathan Jacobson

1,441
11
20

`diff` is not in the original data. `abs(value_t - lead(value_t,1,0) OVER (ORDER BY main,sub_id))>10 AS is_evil` – clamp Nov 23 '20 at 16:01
@clamp Thanks for the clarification - updated the query according to what you wrote. – Jonathan Jacobson Nov 23 '20 at 20:33

score 1 · Accepted Answer · answered Nov 25 '20 at 02:39

Basic `SELECT`

As fast as possible:

SELECT *, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
   SELECT *, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
   FROM   test
   ) sub;

Fine points

Your thought model evolves around the window function lead(). But its counterpart lag() is a bit more efficient for the purpose, since there is no off-by-one error when including the row before the big gap. Alternatively, use lead() with inverted sort order (ORDER BY sub_id DESC).
To avoid NULL for the first row in the partition, provide value_t as default as 3rd parameter, which makes the diff 0 instead of NULL. Both lead() and lag() have that capability.
diff BETWEEN -10 AND 10 is slightly faster than @diff < 11 (clearer and more flexible, too). (@ being the "absolute value" operator, equivalent to the abs() function.)
bool_or() or bool_and() in the outer window function is probably cheapest to mark all rows up to the big gap.

Your `UPDATE`

Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.

Again, as fast as possible.

UPDATE test AS t
SET    newval = 1
FROM  (
   SELECT main, sub_id
        , bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
   FROM (
      SELECT main, sub_id
           , value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
      FROM   test
      ) sub
   ) u
WHERE (t.main, t.sub_id) = (u.main, u.sub_id)
AND    u.flag;

Fine points

Computing all values in a single query is typically substantially faster than a correlated subquery.
The added WHERE condition AND u.flag makes sure we only update rows that actually need an update.
If some of the rows may already have the right value in newval, add another clause to avoid those empty updates, too: AND t.newval IS DISTINCT FROM 1 See:
- How do I (or can I) SELECT DISTINCT on multiple columns?
SET newval = 1 assigns a constant (even though we could use the actually calculated value in this case), that's a bit cheaper.

db<>fiddle here

score 0 · Answer 3 · answered Nov 23 '20 at 10:53

EXISTS on an aggregating subquery:

UPDATE test u
SET value_t = NULL
WHERE EXISTS (
        SELECT * FROM (
                SELECT main,sub_id
                , value_t , ABS(value_t - lag(value_t)
                       OVER (PARTITION BY main ORDER BY sub_id) ) AS absdiff
                FROM test
                ) x
        WHERE x.main = u.main
        AND x.sub_id <= u.sub_id
        AND x.absdiff >= 10
        )
        ;

SELECT * FROM test
ORDER BY main, sub_id;

Result:

UPDATE 3
 main | sub_id | value_t 
------+--------+---------
    1 |      1 |       8
    1 |      2 |       7
    1 |      3 |       3
    1 |      4 |        
    1 |      5 |        
    2 |      1 |       3
    2 |      2 |       1
    2 |      3 |       1
    2 |      4 |       8
    2 |      5 |        
(10 rows)

How to identify rows per group before a certain value gap?

3 Answers3

Basic SELECT

Your UPDATE

Basic `SELECT`

Your `UPDATE`