Aggregate rows to get unique arrays without subsets

Question

Initial data (actual table contains more than 2,000,000 rows):

+--------+--------+-------+
| note   | factor | label |
+--------+--------+-------+
| note_1 | 1      | 2     |
+--------+--------+-------+
| note_1 | 1      | 3     |
+--------+--------+-------+
| note_1 | 2      | 4     |
+--------+--------+-------+
| note_2 | 123    | 2     |
+--------+--------+-------+
| note_2 | 123    | 3     |
+--------+--------+-------+
| note_2 | 2      | 4     |
+--------+--------+-------+
| note_3 | 456    | 4     |
+--------+--------+-------+
| note_4 | 434    | 5     |
+--------+--------+-------+
| note_5 | 456    | 3     |
+--------+--------+-------+
| note_5 | 456    | 4     |
+--------+--------+-------+

What I want to get (further final table):

+----+-----------------+
| id | notes           |
+----+-----------------+
| 1  | {note_1,note_2} |
+----+-----------------+
| 2  | {note_4}        |
+----+-----------------+
| 3  | {note_3,note_5} |
+----+-----------------+

More clearly:

I need to group notes by factor and label columns. Note can be in the result table only once. Result table should contains two columns: id - row number, notes - array of notes.

I have written a query to group by factor and label:

select row_number() over (order by factor) as id
     , array_agg(note order by note) as notes
from test_brand
group by factor, label

It gives these results:

+---+-----------------+
| 1 | {note_1}        |
+---+-----------------+
| 2 | {note_1}        |
+---+-----------------+
| 3 | {note_2}        |
+---+-----------------+
| 4 | {note_2}        |
+---+-----------------+
| 5 | {note_1,note_2} |
+---+-----------------+
| 6 | {note_4}        |
+---+-----------------+
| 7 | {note_5}        |
+---+-----------------+
| 8 | {note_3,note_5} |
+---+-----------------+

But I do not know how to get the final table proceeding from here.

If we omit identifiers and return to ordinary numbers, then this task looks like a union of sets (which in fact it is).
Let's say we have 8 sets: {1}, {1}, {2}, {2}, {1,2}, {4}, {5}, {3,5}. We need to get three sets: {1,2}, {4}, {3,5}.

How it happened in my opinion:
Sets {1}, {1}, {2}, {2}, {1,2} merged into one set {1,2}, because there is intersection between {1} and {2} with {1,2}.
Sets {3,5}, {5} merged into one set {3,5}, because there is intersection between {5} and {3,5}.
Set {4} does not intersect with anyone, so it remains as it is.

Erwin Brandstetter · Answer 1 · 2021-04-16T04:12:30.927

2

There may be more efficient ways, but this does it:

WITH cte AS (
   SELECT min(rn) AS rn, notes  -- to remove dupes cheaply
   FROM  (
      SELECT row_number() OVER (ORDER BY factor, label) AS rn  -- ORDER BY factor, label?!
           , array_agg(note ORDER BY note) AS notes
      FROM   test_brand
      GROUP  BY factor, label
      ) sub
   GROUP  BY notes
   )
SELECT row_number() OVER (ORDER BY rn) AS rn, notes
FROM   cte c
WHERE  NOT EXISTS (
   SELECT FROM cte c1
   WHERE c1.notes @> c.notes
   AND   c1.rn <> c.rn
   )
ORDER  BY 1;

db<>fiddle here

After your initial query, remove duplicates in the CTE and remember minimum row number.

In the final SELECT, eliminate all rows where the set is contained in another set (except itself). Compact row numbers with another instance of row_number().
Voilá.

Optimize performance

more than 2,000,000 rows.

If note can be an integer instead of the string type, computation will be substantially faster, even more so after installing the additional module intarray, which provides a faster implementation of @> operator for integer arrays.

If the derived table from the CTE is still big, it might pay to create a temporary table, add an index (and ANALYZE!), and run the outer SELECT based on that temp table:


CREATE TEMP TABLE tmp AS (
   SELECT min(rn) AS rn, notes  -- to remove dupes cheaply
   FROM  (
      SELECT row_number() OVER (ORDER BY factor, label) AS rn
           , array_agg(note ORDER BY note) AS notes
      FROM   test_brand
      GROUP  BY factor, label
      ) sub
   GROUP  BY notes
   );

CREATE INDEX ON tmp USING gin (notes gin__int_ops);
ANALYZE tmp;

SELECT row_number() OVER (ORDER BY rn) AS rn, notes
FROM   tmp c
WHERE  NOT EXISTS (
   SELECT FROM tmp c1
   WHERE c1.notes @> c.notes
   AND   c1.rn <> c.rn
   )
ORDER  BY 1;

See:

edited Apr 16 '21 at 04:12

answered Apr 16 '21 at 03:37

Erwin Brandstetter

605,456
145
1,078
1,228

thanks for the answer! I'm not sure, but something went wrong. You can see run example here: [fiddle](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=f8ca7e2ee429e346dfcde8360fd72215) I just add two new rows: `note_7, 123, 2` and `note_8, 656, 8`. Correct result will be: {note_1,note_2,note_7} {note_4} {note_3,note_5} {note_8} But running the script gave the following result: {note_1,note_2} {note_2,note_7} {note_4} {note_3,note_5} {note_8} – Moon Apr 16 '21 at 11:32
@ErwinBrandstetter . . . Somehow your starting an answer with "there may be more efficient ways" is cognitively dissonant. Almost by definition, I expect your answers to be the most efficient way of doing something in Postgres. – Gordon Linoff Apr 16 '21 at 12:24
@Moon But `note_1` and `note_7` do not share `(factor, label)`? – Erwin Brandstetter Apr 16 '21 at 13:49
@Gordon: Wasn't so sure of it this time. `@>` is an expensive operation without index. The added bit about optimization made it better for me. Still, I feel like there may be more potential to speed it up. (And, thank you.) – Erwin Brandstetter Apr 16 '21 at 13:57
@ErwinBrandstetter, yes, they are not. But `note_1` share `(factor, label)` with `note_2`, and `note_2` share `(factor, label)` with `note_7`. And than we need to group them all together. In other words: {1}, {1,2}, {2,7} should create set of {1,2,7}. – Moon Apr 16 '21 at 14:22
@Moon: Well, that's not obvious from your question. Then you need to do more. I am out of time ATM. – Erwin Brandstetter Apr 16 '21 at 14:43
@ErwinBrandstetter, I'll add more rows to the initial data. I will try my best to find something that can help solve the current problem, but if you have any ideas later, I will be glad to see them. Thank you! – Moon Apr 16 '21 at 14:49
I think I answered your original question as given. Now you *changed* it into a different (much harder) question. Changing the nature of a question after answers have been given is not the way to go. Start a new question instead please. You can always reference this one for context, and drop a comment here to link forward. I've rolled back your last edit. You can find everything in the edit history. – Erwin Brandstetter Apr 16 '21 at 23:51

Aggregate rows to get unique arrays without subsets

1 Answers1

Optimize performance