Consider a custom aggregate intended to take the set union of a bunch of arrays:
CREATE FUNCTION array_union_step (s ANYARRAY, n ANYARRAY) RETURNS ANYARRAY
AS $$ SELECT s || n; $$
LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;
CREATE FUNCTION array_union_final (s ANYARRAY) RETURNS ANYARRAY
AS $$
SELECT array_agg(i ORDER BY i) FROM (
SELECT DISTINCT UNNEST(x) AS i FROM (VALUES(s)) AS v(x)
) AS w WHERE i IS NOT NULL;
$$
LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;
CREATE AGGREGATE array_union (ANYARRAY) (
SFUNC = array_union_step,
STYPE = ANYARRAY,
FINALFUNC = array_union_final,
INITCOND = '{}',
PARALLEL = SAFE
);
As I understand it, array concatenation in PostgreSQL copies all the elements of both inputs into a new array, so this is quadratic in the total number of elements (before deduplication). Is there a more efficient alternative without writing extension code in C? (Specifically, using either LANGUAGE SQL or LANGUAGE plpgsql.) For instance, maybe it's possible for the step function to take and return a set of rows somehow?
An example of the kind of data this needs to be able to process:
create temp table demo (tag int, values text[]);
insert into demo values
(1, '{"a", "b"}'),
(2, '{"c", "d"}'),
(1, '{"a"}'),
(2, '{"c", "e", "f"}');
select tag, array_union(values) from demo group by tag;
tag | array_union
-----+-------------
2 | {c,d,e,f}
1 | {a,b}
Note in particular that the built-in array_agg
cannot be used with this data, because the arrays are not all the same length:
select tag, array_agg(values) from demo group by tag;
ERROR: cannot accumulate arrays of different dimensionality