GROUP BY consecutive rows where columns are equal or NULL

Question

In Postgres 9.2, I'm trying to group consecutive rows. They must have at least one non-null match, and no non-null mismatch. If all values are null then don't group together. Null can be thought of as a wildcard.

This is the expected result:
2, 4, 5 and 6 get grouped together because 2 and 4 share column1 (3 is all-null and skipped), 4 and 5 share column 3, 4 and 6 share column2 and column1.

Here's the SQL fiddle.

id 7 shares column2 with 6 and 4 right? So it should be 2,4,5,6,7 together? — Joakim Danielson, Aug 09 '18 at 08:42
There could be cycles, like rows 1 and 2 share column 1, rows 2 and 3 share column 2, rows 3 and 4 share column 3, but rows 1 and 4 have different columns 1. How to group in this case? — Laurenz Albe, Aug 09 '18 at 08:48
@JoakimDanielson No, because id=7 has a different column1 value. Groups must have at least 1 value in common and the rest can be NULL. Values can't be different otherwise. — Meow, Aug 09 '18 at 08:50
1 value in common between the members, not across the whole group if that makes sense. — Meow, Aug 09 '18 at 09:05
@LaurenzAlbe There could be some iteration involved. Null columns count as a wildcard (i.e, value in common) only when there's at least one non-null column value in common, if that makes sense. If all column values are null then it's not a match. — Meow, Aug 09 '18 at 09:17
@S-Man Let's say fixed. There are more columns but I'm only interested in these 3. — Meow, Aug 09 '18 at 09:31
If I add `8, A2, A3, A8`, what makes us pick row 4 for 2 and 6 to be grouped with? — Damien_The_Unbeliever, Aug 09 '18 at 09:51
@Damien_The_Unbeliever I'm trying to convert Java code into a faster SQL version. The Java code simply iterated through the result and built groups as follows: `foreach(result) if(groups.contain(result)) group.updateUsing(result) else groups.add(result)` so 2 and 6 get grouped with 4 because it was simply first in the list. When it gets to 8 then it would make a new group. — Meow, Aug 10 '18 at 08:43

S-Man · Accepted Answer · 2018-08-13T14:34:13.197

For fixed three columns this could be a possible solution.

http://sqlfiddle.com/#!17/45dc7/137

Disclaimer: This will not work if there could be same values in different columns. E.g. One row (42, NULL, "A42", NULL) and one row (23, "A42", NULL, NULL) will end in unwanted results. The fix for that is to concatenate a column identifier with an unique delimiter to the string and remove it after the operation by string split.

WITH test_table as (
    SELECT *, 
    array_remove(ARRAY[column1,column2,column3], null) as arr, -- A
    cardinality(array_remove(ARRAY[column1,column2,column3], null))as arr_len
FROM test_table )

SELECT 
    s.array_agg as aggregates,                                 -- G
    MAX(tt.column1) as column1, 
    MAX(tt.column2) as column2, 
    MAX(tt.column3) as column3
FROM (

    SELECT array_agg(id) FROM                                  -- E
        (SELECT DISTINCT ON (t1.id)
        t1.id, CASE WHEN t1.arr_len >= t2.arr_len THEN t1.arr ELSE t2.arr END as arr  -- C
        FROM 
        test_table as t1 
        JOIN                                                   -- B
        test_table as t2
        ON t1.arr @> t2.arr AND COALESCE(t2.column1, t2.column2, t2.column3) IS NOT NULL
        OR t2.arr @> t1.arr AND COALESCE(t1.column1, t1.column2, t1.column3) IS NOT NULL

        ORDER BY t1.id, GREATEST(t1.arr_len, t2.arr_len) DESC -- D
        ) s
    GROUP BY arr  

    UNION

    SELECT 
        ARRAY[id] 
    FROM test_table tt 
    WHERE COALESCE(tt.column1, tt.column2, tt.column3) IS NULL) s -- F

JOIN test_table tt ON tt.id = ANY (s.array_agg)
GROUP BY s.array_agg

A: Aggregate the column values and removing the NULL values. The reason is that I check for subsets later which will not work with NULLs. This is the point where you should add the column identifier as mentioned in the disclaimer above.

B: CROSS JOIN the table against itself. Here I am checking if one column aggregate is a subset of another. The rows with only NULL values are ignored (this is the COALESCE function)

C: Getting the column array with the highest length either from the first or from the second table. It depends on its id.

D: With the ORDER BY the longest array and the DISTINCT it is assured that only the longest array is given for each id

E: Now there are many ids with the same column array sets. The array sets are used to aggregate the ids. Here the ids are put together.

F: Add all NULL rows.

G: One last JOIN against all columns. The rows are taken that are part of the id aggregation from (E). After that the MAX value is grouped per column.

Edit: Fiddle for PostgreSQL 9.3 (array_length instead of cardinality function) and added test case (8, 'A2', 'A3', 'A8')

http://sqlfiddle.com/#!15/8800d/2

Whoa, thanks. It also works for the test case added by Damien `8,A2,A3,A8`. Will adapt this later to my actual code and see how it works. — Meow, Aug 10 '18 at 18:56
@Meow added Damien's test case and decreased Postgres version in the fiddle because of your 9.2 version as stated in your post comment — S-Man, Aug 10 '18 at 19:04

S-Man · Answer 2 · 2018-08-10T20:47:13.563

Another idea came to my mind which could be more dynamically concerning the number of columns. It's just an idea and I do not really know if it works. But it's worth a try.

Maybe you could pivot your table so that your colums become your rows:

https://www.postgresql.org/docs/9.1/static/tablefunc.html

http://www.vertabelo.com/blog/technical-articles/creating-pivot-tables-in-postgresql-using-the-crosstab-function

After that it should be easy to do the grouping or you could do a partition over the column content with a window function.

Just a sketch, could try it later.

Erwin Brandstetter · Answer 3 · 2018-08-11T07:45:46.073

SQL is a powerful declarative language (4GL) - well, mostly. A declarative (set-based) solution is typically fastest.

But some workloads are very "procedural" by definition and hard to implement. This is one of those rare cases: a procedural solution can make do with a single sequential scan and should be faster than equivalent pure SQL solutions by a long shot:

CREATE OR REPLACE FUNCTION f_my_grouping()
  RETURNS SETOF int[] AS
$func$
DECLARE
   r  tbl; -- use table type as row variable
   r0 tbl;
   ids int[];
BEGIN
   FOR r IN
      SELECT * FROM tbl t ORDER BY t.id
   LOOP
      IF (r.column1, r.column2, r.column3) IS NULL THEN     -- all NULL
         RETURN NEXT ARRAY[r.id];  -- return and ignore

      ELSIF (r.column1 <> r0.column1 OR                     -- continue
             r.column2 <> r0.column2 OR
             r.column3 <> r0.column3) IS NOT TRUE  -- no mismatch
        AND (r.column1 =  r0.column1 OR
             r.column2 =  r0.column2 OR
             r.column3 =  r0.column3) THEN         -- 1+ match

         ids := ids || r.id;     -- add to array

         IF r0.column1 IS NULL AND r.column1 IS NOT NULL OR
            r0.column2 IS NULL AND r.column2 IS NOT NULL OR
            r0.column3 IS NULL AND r.column3 IS NOT NULL THEN

            SELECT INTO r0.column1, r0.column2, r0.column3 
                   COALESCE(r0.column1, r.column1)
                 , COALESCE(r0.column2, r.column2)
                 , COALESCE(r0.column3, r.column3);
         END IF;

      ELSE                                                  -- new grp
         IF r0 IS NULL THEN      -- skip 1st row
            -- do nothing
         ELSE
            RETURN NEXT ids;
         END IF;
         ids := ARRAY[r.id];     -- start new array
         r0  := r;               -- remember last row
      END IF;
   END LOOP;

   IF ids IS NOT NULL THEN  -- all NULL
      RETURN NEXT ids;  -- output last iteration
   END IF;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_my_grouping();

If you require sorted output:

SELECT * FROM f_my_grouping() ORDER BY 1;

db<>fiddle here (running Postgres 9.4)

Compare performance with EXPLAIN ANALYZE.

GROUP BY and aggregate sequential numeric values

This works just as well (even when adding `8, 'A2', 'A3', 'A8'`) but it's more readable, thanks! — Meow, Aug 11 '18 at 11:16
@Meow: Do you happen to have a big table to test and compare performance? I would be interested in actual times with `EXPLAIN ANALYZE` - as compared to S-Man's pure-SQL solution. — Erwin Brandstetter, Aug 12 '18 at 01:43

GROUP BY consecutive rows where columns are equal or NULL

3 Answers3