Optimize performance for complex array aggregation

Question

I have a table of the following format:

Table "my_table"
       Column          |            Type             | Collation | Nullable |
-----------------------+-----------------------------+-----------+----------+
 foreign_key_column_id | integer                     |           | not null |
 column1_value_1       | character varying           |           | not null |
 column1_value_2       | character varying           |           |          |
 column1_value_3       | character varying           |           |          |

There can be multiple rows with same foreign_key_column_id. My goal is to merge all rows of same foreign_key_column_id into the following format:

foreign_key_column_id
column1::jsonb

So that the resulting array is a combination of distinct non-null values of the three columns. (Then I will insert the resulting dataset into a new table)

I have achieved it by the following, but I am sure this is resource-inefficient (too many functions) and because the table I will run it on has close to 100 million rows, I have to care about performance.

SELECT
  foreign_key_column_id,
  array_to_json(
    array_remove(
      array(
        SELECT DISTINCT unnest(
          array_agg(
            array[
              column1_value_1,
              column1_value_2,
              column1_value_3
            ]
          )
        )
      ),
      NULL
    )
  ) AS column1
  FROM my_table
  GROUP BY foreign_key_column_id

My actual, complete situation:

-- initial table
TABLE public.my_table (
  id integer NOT NULL,
  foreign_key_column_id integer NOT NULL,
  column1_value_1 character varying NOT NULL,
  column1_value_2 character varying,
  column1_value_3 character varying,
  column2_value_1 character varying NOT NULL,
  column2_value_2 character varying,
  column2_value_3 character varying,
  column3_value_1 character varying(90) NOT NULL,
  column3_value_2 character varying(90),
  column3_value_3 character varying(90)
)

-- desired table into which I will insert the aggregated select

TABLE public.my_new_table (
  id bigint NOT NULL,
  foreign_key_column_id bigint NOT NULL,
  headlines jsonb NOT NULL,
  descriptions jsonb NOT NULL
);

-- sample data
-[ RECORD 1 ]-------+--
id                    | 111
foreign_key_column_id | 12
column1_value_1       | Foo
column1_value_2       |
column1_value_3       |
column2_value_1       | Foo %{1}
column2_value_2       | 
column2_value_3       | Foo %{3}
column3_value_1       | Foo : Bar 
column3_value_2       |
column3_value_3       |
-[ RECORD 2 ]-------+--
id                    | 222
foreign_key_column_id | 12
column1_value_1       | Bar %{1}
column1_value_3       | 
column1_value_2       | Bar %{2}
column2_value_1       | Bar
column2_value_2       |
column2_value_3       |
column3_value_1       | Qux
column3_value_2       |
column3_value_3       |
-[ RECORD 3 ]-------+--
id                    | 33
foreign_key_column_id | 1234
column1_value_1       | Foo
column1_value_2       |
column1_value_3       |
column2_value_1       | Baz %{1}
column2_value_3       | 
column2_value_2       | Baz %{2}
column3_value_1       | 
column3_value_2       | Hello world
column3_value_3       |

-- expected output
-[ RECORD 1 ]-------+--
id                    | 1
foreign_key_column_id | 12
column1               | ["Foo", "Bar %{1}", "Bar %{2}"]
column2               | ["Foo %{1}", "Foo %{3}", "Bar"]
column3               | ["Foo : Bar ", "Qux"]

-[ RECORD 2 ]-------+--
id                    | 2
foreign_key_column_id | 1234
column1               | ["Foo"]
column2               | ["Baz %{1}", "Baz %{2}"]
column3               | ["Hello world"]

If you want all distinct values of the three columns for each foreign key, you will have to process 300 million values at one point or the other. So I don't think there is a more efficient way. — , Apr 12 '22 at 12:25
@a_horse_with_no_name I've added sample data and output, thanks for looking into it — Andrey Deineko, Apr 12 '22 at 12:37
@a_horse_with_no_name also, I was trying to understand whether for my new table a jsonb is the proper datastructure to use. I need to store a limited amount of string entries in array, so need array column. — Andrey Deineko, Apr 12 '22 at 12:44
I am not sure, but it could be that `text[]` is a bit more efficient than `jsonb` — , Apr 12 '22 at 12:45
Consider the fiddle in my answer: That's a much more forthcoming way to provide test data. — Erwin Brandstetter, Apr 12 '22 at 17:21

Erwin Brandstetter · Answer 1 · 2022-04-12T17:24:00.457

I believe this to be substantially faster: a LATERAL JOIN to a VALUES expression, effectively unpivoting the columns to rows:

SELECT foreign_key_column_id
     , jsonb_agg(DISTINCT c.col1) AS column1
FROM   my_table t
LEFT   JOIN LATERAL (
   VALUES
     (column1_value_1)
   , (column1_value_2)
   , (column1_value_3)
   ) c(col1) ON col1 IS NOT NULL
GROUP  BY 1;

db<>fiddle here

Most importantly, the big table is only scanned once.

NULL values are removed early. Since column1_value_1 is defined NOT NULL there is always at least one row to join, and we never re-introduce NULL values by way of "no row". (Else we'd add a FILTER clause like below.)

Plus some other simplifications.

SELECT DISTINCT on multiple columns

To process all at once (your added table with 3x3 columns):

SELECT foreign_key_column_id
     , jsonb_agg(DISTINCT c.col1) AS column1
     , jsonb_agg(DISTINCT c.col2) FILTER (WHERE c.col2 IS NOT NULL) AS column2 -- !
     , jsonb_agg(DISTINCT c.col3) FILTER (WHERE c.col3 IS NOT NULL) AS column3
FROM   my_table3 t
LEFT   JOIN LATERAL (
   VALUES
     (column1_value_1, column2_value_1, column3_value_1)
   , (column1_value_2, column2_value_2, column3_value_2)
   , (column1_value_3, column2_value_3, column3_value_3)
   ) c(col1, col2, col3) ON NOT (col1, col2, col3) IS NULL  -- !
GROUP  BY 1;

db<>fiddle here

Still a single scan ove the big table.

Removing NULL values is not as simple, as we cannot collapse all three columns for a single NULL. But we can still drop rows with only NULL values right away. Testing ROW values for NULL is tricky. Note the syntax! NOT (col1, col2, col3) IS NULL. See:

NOT NULL constraint over a set of columns

Remove remaining NULL values with the FILTER clause in the outer aggregate. See:

Aggregate columns with additional (distinct) filters

We don't need FILTER for column1 as source columns are defined NOT NULL.

Optimize performance for complex array aggregation

1 Answers1