Group by difference from starting point of group

Question

I have a lot of measurements in Postgres database table and I need to split this set in groups when some value goes too far away from a "starting" point of the current group (more then some threshold). Sort order is determined by the id column.

Example: splitting with threshold = 1:

id measurements
---------------
1  1.5
2  1.4
3  1.8
4  2.6
5  3.7
6  3.5
7  3.0
8  2.6
9  2.5
10 2.8

Should be split in groups as follows:

id measurements group
---------------------
1  1.5            0     --- start new group 
2  1.4            0
3  1.8            0

4  2.6            1     --- start new group because it too far from 1.5

5  3.7            2     --- start new group because it too far from 2.6
6  3.5            2
7  3.0            2

8  2.6            3     --- start new group because it too far from 3.7
9  2.5            3
10 2.8            3

I can do this by writing a function using LOOP, but I'm looking for a more efficient way. Performance is very important as the actual table contains millions of rows.

Is it possible to achieve the goal by using PARTITION OVER, CTE or any other kind of SELECT?

You seem to be assuming that there is an ordering to the table. SQL tables represent *unordered* sets. You also need to define "too far". — Gordon Linoff, Oct 25 '19 at 16:49
@xQbert. May be it seems similar but gaps mean distance from previous value. It is not my case. — kayman, Oct 25 '19 at 16:53
@Gordon Linoff. Yes, I mean this set is ordered. By id, or timestamp, whatever. About "too far" - I told: in my example threshold is 1. So new group starts when value differs from starting value of this group by more then 1. — kayman, Oct 25 '19 at 16:56

score 0 · Answer 1 · answered Oct 25 '19 at 16:51

0

You seem to be starting a group when the difference between rows exceeds 0.5. If I assume you have an ordering column, you can use lag() and cumulative sum to get your groups:

select t.*,
       count(*) filter (where prev_value < value - 0.5) as grouping
from (select t.*,
             lag(value) over (order by <ordering col>) as prev_value
      from t
     ) t

answered Oct 25 '19 at 16:51

Gordon Linoff

1,242,037
58
646
786

No. Sarting a group is not depend on previous value. A new group starts when current value differ from starting point of current group more than threshold. In my example threshold is 1. – kayman Oct 25 '19 at 17:06

Thomas Phaneuf · Answer 2 · 2019-10-25T17:52:36.043

One way to attack this problem is by using a recursive CTE. This example is written using SQL Server syntax (because I don't work with postgres). It should be straightforward to translate, however.

--  Table #Test: 
--  sequenceno  measurements
--  ----------- ------------
--  1           1.5
--  2           1.4
--  3           1.8
--  4           2.6
--  5           3.7
--  6           3.5
--  7           3.0
--  8           2.6
--  9           2.5
--  10          2.8

WITH datapoints
AS
(
    SELECT  sequenceno,
            measurements,
            startmeasurement    = measurements,
            groupno             = 0
    FROM    #Test
    WHERE   sequenceno = 1

    UNION ALL

    SELECT  sequenceno          = A.sequenceno + 1,
            measurements        = B.measurements,
            startmeasurement    = 
                CASE 
                WHEN abs(B.measurements - A.startmeasurement) >= 1 THEN B.measurements
                ELSE A.startmeasurement
                END,
            groupno             = 
                A.groupno + 
                CASE 
                WHEN abs(B.measurements - A.startmeasurement) >= 1 THEN 1
                ELSE 0
                END
    FROM    datapoints as A
            INNER JOIN #Test as B
                ON A.sequenceno  + 1 = B.sequenceno
) 
SELECT  sequenceno,
        measurements,
        groupno
FROM    datapoints
ORDER BY
        sequenceno

--  Output:
--  sequenceno  measurements    groupno
--  ----------- --------------- -------
--  1           1.5             0
--  2           1.4             0
--  3           1.8             0
--  4           2.6             1
--  5           3.7             2
--  6           3.5             2
--  7           3.0             2
--  8           2.6             3
--  9           2.5             3
--  10          2.8             3

Note that I added a "sequenceno" column in the starting table because relational tables are considered to be unordered sets. Also, if the number of input values is too large (over 90-100), you may have to adjust the MAXRECURSION value (at least in SQL Server).

Additional note: Just noticed that the original question mentions that there are millions of records in the input data sets. The CTE approach would only work if that data could be broken up into manageable chunks.

Thanks for your answer! Yes, probably I can divide whole set into "manageable" chunks of 100-120 records. But how can I manage all chunks without writing LOOP? Let's assume I have column **"chunkno"**. How It helps me to use your approach? — kayman, Oct 25 '19 at 19:33
I would imagine that you would still have to loop through the "manageable chunks" with a loop. But looping through a bunch of big set-based operations is not always a bad solution. For instance, I often do that when updating or deleting from large tables to allow for smaller implicit transactions. For instance, when operating on 2 million records from a very large table, I will use a loop to perform on sets of a few thousand on each iteration.That way, if something fails, the implicit rollback happens quickly on just the set that failed. — Thomas Phaneuf, Oct 26 '19 at 10:00
Generally, i agree with your point. But if i have to LOOP through chunks it is mean i don't need chunks at all. I can handle what i need in single LOOP through records — kayman, Oct 26 '19 at 10:10

Erwin Brandstetter · Accepted Answer · 2019-10-25T22:25:15.793

Is it possible to achieve the goal by using PARTITION OVER, CTE or any other kind of SELECT?

This is an inherently procedural problem. Depending on where you start, all later rows can end up in a different group and / or with a different group value. Window functions (using the PARTITION clause) are no good for this.

You can use a recursive CTE:

WITH RECURSIVE rcte AS (
   (
   SELECT id
        , measurement
        , measurement - 1 AS grp_min
        , measurement + 1 AS grp_max
        , 1 AS grp
   FROM   tbl
   ORDER  BY id
   LIMIT  1
   )

   UNION ALL
   (
   SELECT t.id
        , t.measurement
        , CASE WHEN t.same_grp THEN r.grp_min ELSE t.measurement - 1 END  -- AS grp_min 
        , CASE WHEN t.same_grp THEN r.grp_max ELSE t.measurement + 1 END  -- AS grp_max
        , CASE WHEN t.same_grp THEN r.grp     ELSE r.grp + 1         END  -- AS grp
   FROM   rcte r 
   CROSS  JOIN LATERAL (
      SELECT *, t.measurement BETWEEN r.grp_min AND r.grp_max AS same_grp
      FROM   tbl t
      WHERE  t.id > r.id
      ORDER  BY t.id
      LIMIT  1
      ) t
   )
   )
SELECT id, measurement, grp
FROM   rcte;

It's elegant. And decently fast. But only about as fast as - or even slower than - a procedural language function with a single loop over the set - when implemented efficiently:

CREATE OR REPLACE FUNCTION f_measurement_groups(_threshold numeric = 1)
  RETURNS TABLE (id int, grp int, measurement numeric) AS
$func$
DECLARE
   _grp_min numeric;
   _grp_max numeric;
BEGIN
   grp := 0;  -- init

   FOR id, measurement IN
      SELECT * FROM tbl t ORDER BY t.id
   LOOP
      IF measurement BETWEEN _grp_min AND _grp_max THEN
         RETURN NEXT;
      ELSE
         SELECT INTO grp    , _grp_min                , _grp_max
                     grp + 1, measurement - _threshold, measurement + _threshold;
         RETURN NEXT;
      END IF;
   END LOOP;
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_measurement_groups();  -- optionally supply different threshold

db<>fiddle here

My money is on the procedural function.
Typically, set-based solutions are faster. But not when solving an inherently procedural problem.

GROUP BY and aggregate sequential numeric values

I've checked both methods on test table with 4 millions records. Recursive CTE was extremely slow - 20 minutes of waiting (i've cancelled the query) while function took 18 seconds! It means I stick to function. Thanks for your efforts. — kayman, Oct 26 '19 at 10:15

Group by difference from starting point of group

3 Answers3