1

Consider a custom aggregate intended to take the set union of a bunch of arrays:

CREATE FUNCTION array_union_step (s ANYARRAY, n ANYARRAY) RETURNS ANYARRAY
   AS $$ SELECT s || n; $$
   LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;

CREATE FUNCTION array_union_final (s ANYARRAY) RETURNS ANYARRAY
  AS $$
    SELECT array_agg(i ORDER BY i) FROM (
      SELECT DISTINCT UNNEST(x) AS i FROM (VALUES(s)) AS v(x)
    ) AS w WHERE i IS NOT NULL;
  $$
  LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;

CREATE AGGREGATE array_union (ANYARRAY) (
  SFUNC = array_union_step,
  STYPE = ANYARRAY,
  FINALFUNC = array_union_final,
  INITCOND = '{}',
  PARALLEL = SAFE
);

As I understand it, array concatenation in PostgreSQL copies all the elements of both inputs into a new array, so this is quadratic in the total number of elements (before deduplication). Is there a more efficient alternative without writing extension code in C? (Specifically, using either LANGUAGE SQL or LANGUAGE plpgsql.) For instance, maybe it's possible for the step function to take and return a set of rows somehow?


An example of the kind of data this needs to be able to process:

create temp table demo (tag int, values text[]);
insert into demo values
   (1, '{"a", "b"}'),
   (2, '{"c", "d"}'),
   (1, '{"a"}'),
   (2, '{"c", "e", "f"}');

select tag, array_union(values) from demo group by tag;
 tag | array_union 
-----+-------------
   2 | {c,d,e,f}
   1 | {a,b}

Note in particular that the built-in array_agg cannot be used with this data, because the arrays are not all the same length:

select tag, array_agg(values) from demo group by tag;
ERROR:  cannot accumulate arrays of different dimensionality
zwol
  • 135,547
  • 38
  • 252
  • 361

1 Answers1

2

Array concatenation is expensive. That's why build-in array_agg() uses the internal array builder structure. Unfortunately, you cannot use this API on the SQL level.

I don't think using temp tables for custom aggregation is correct. Creating and dropping tables is expensive (temp table too) or very expensive (with high frequency), INSERT and SELECT is expensive, too. (Tables have a much more complex format than arrays.) If you need really fast aggregation, then write a C functions and use the array builder.

If you cannot use your own C extension, then use the built-in array_agg() function with deduplication and sort on already aggregated data.

CREATE OR REPLACE FUNCTION array_distinct(anyarray)
RETURNS anyarray AS $$
BEGIN
  RETURN ARRAY(SELECT DISTINCT v FROM unnest($1) u(v) WHERE v IS NOT NULL ORDER BY v);
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

Call:

SELECT ..., array_distinct(array_agg(somecolumn)) FROM tab;
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Pavel Stehule
  • 42,331
  • 5
  • 91
  • 94
  • 1
    Or just `LANGUAGE sql`? With `PARALLEL SAFE`? – Erwin Brandstetter Jul 30 '20 at 23:02
  • I'm curious, is it only array concatenation or also array indexed assignment that is slow? Could [this aggregate](https://stackoverflow.com/a/58098627/1048572) be improved by an `array_builder` as well? – Bergi Jul 31 '20 at 01:01
  • Unfortunately, this doesn't work for me, because `array_agg(somecolumn)` is only valid when `somecolumn` has a *scalar* type (`anyelement` in pg-ese) or when all the arrays in `somecolumn` have exactly the same dimensions. I have a column of 1D arrays of *variable* length. – zwol Jul 31 '20 at 02:14
  • @ErwinBrandstetter - yes, it is possible too, and it is faster – Pavel Stehule Jul 31 '20 at 04:40
  • @zwol - array_agg can aggregate arrays too. Just you need Postgres 9.5 and higher. But if cannot to use array_agg, then you need to write own C function. – Pavel Stehule Jul 31 '20 at 04:43
  • @Bergi - string concatenation is similar - string_agg use own variant of string builder. – Pavel Stehule Jul 31 '20 at 04:46
  • @PavelStehule I'm using Postgres 12.2 and `array_agg` does _not_ accept arrays that aren't all the same length. See the example data I just added to the question. – zwol Jul 31 '20 at 13:25
  • @zwol - aha -yes, there is this limit. Then you need to write own function in C – Pavel Stehule Jul 31 '20 at 14:30
  • Let me be clear that the question is specifically how do I do this _without_ dropping down to C. I appreciate your answer, though, it has clarified my understanding of the problem and may be useful to other people even if it isn't useful to me. – zwol Jul 31 '20 at 14:38
  • @zwol - I am sorry, there is not any other way – Pavel Stehule Jul 31 '20 at 18:12