Aggregate single array of distinct elements from array column, excluding NULL

Question

I'm trying to roll up the distinct non-null values of timestamps stored in a PostgreSQL 9.6 database column.

So given a table containing the following:

date_array
------------------------
{2019-10-21 00:00:00.0}
{2019-08-06 00:00:00.0,2019-08-05 00:00:00.0}
{2019-08-05 00:00:00.0}
(null)
{2019-08-01 00:00:00.0,2019-08-06 00:00:00.0,null}

The desired result would be:

{2019-10-21 00:00:00.0, 2019-08-06 00:00:00.0, 2019-08-05 00:00:00.0, 2019-08-01 00:00:00.0}

The arrays can be different sizes so most solutions I've tried end up running into a Code 0:

SQL State: 2202E  
ERROR: cannot accumulate arrays of different dimensionality.

Some other caveats:

The arrays can be null, the arrays can contain a null. They happen to be timestamps of just dates (eg without time or timezone). But in trying to simplify the problem, I've had no luck in changing the sample data to strings (e.g {foo, bar, (null)}, {foo,baz}) - just to focus on the problem and eliminate any issues I miss/don't understand about timestamps w/o timezone.

This following SQL is the closest I've come (it resolves all but the different dimensionality issues):

SELECT 
   ARRAY_REMOVE ( ARRAY ( SELECT DISTINCT UNNEST ( ARRAY_AGG ( CASE WHEN ARRAY_NDIMS(example.date_array) > 0 AND example.date_array IS NOT NULL THEN example.date_array ELSE '{null}' END ) ) ), NULL) as actualDates
FROM example;

I created the following DB fiddle with sample data that illustrates the problem if the above is lacking: https://www.db-fiddle.com/f/8m469XTDmnt4iRkc5Si1eS/0

Additionally, I've perused stackoverflow on the issue (as well as PostgreSQL documentation) and there are similar questions with answers, but I've found none that are articulating the same problem I'm having.

klin · Answer 1 · 2021-08-06T16:01:12.147

Use unnest() in FROM clause (in a lateral join):

select array_agg(distinct elem order by elem desc) as result
from example
cross join unnest(date_array) as elem
where elem is not null

Test it in DB Fiddle.

A general note. An alternative solution using an array constructor is more efficient, especially in cases as simple as described. Personally, I prefer to use aggregate functions because this query structure is more general and flexible, easy to extend to handle more complex problems (e.g. having to aggregate more than one column, grouping by another column, etc). In these non-trivial cases, the performance differences tend to decrease, but the code using aggregates remains cleaner and more readable. It's an extremely important factor when you have to maintain really large and complex projects.

See also In Postgres select, return a column subquery as an array?

Erwin Brandstetter · Accepted Answer · 2021-08-06T16:46:21.300

1

Plain array_agg() does this with arrays:

Concatenates all the input arrays into an array of one higher dimension. (The inputs must all have the same dimensionality, and cannot be empty or null.)

Not what you need. See:

Is there something like a zip() function in PostgreSQL that combines two arrays?

You need something like this: unnest(), process and sort elements an feed the resulting set to an ARRAY constructor:

SELECT ARRAY(
   SELECT DISTINCT elem::date
   FROM  (SELECT unnest(date_array) FROM example) AS e(elem)
   WHERE  elem IS NOT NULL
   ORDER  BY elem DESC
   );

db<>fiddle here

To be clear: we could use array_agg() (taking non-array input, different from your incorrect use) instead of the final ARRAY constructor. But the latter is faster (and simpler, too, IMO).

They happen to be timestamps of just dates (eg without time or timezone)

So cast to date and trim the noise.

Should be the fastest way:

A correlated subquery is a bit faster than a LATERAL one (and does the simple job).
An ARRAY constructor is a bit faster than the aggregate function array_agg() (and does the simple job).
Most importantly, sorting and applying DISTINCT in a subquery is typically faster than inline ORDER BY and DISTINCT in an aggregate function (and does the simple job).

See:

Performance comparison:

db<>fiddle here

edited Aug 06 '21 at 16:46

answered Aug 06 '21 at 02:56

Erwin Brandstetter

605,456
145
1,078
1,228

`array_agg ()` has been widely used for years for any non-array parameters. The variant with array parameters was introduced later as an additional functionality. Your perception of this is somewhat extravagant, to put it mildly. – klin Aug 06 '21 at 09:10
@klin I added a performance comparison to provide some numbers for your "extravagant perception". – Erwin Brandstetter Aug 06 '21 at 14:06
I did not mention performance in my comment. So let me put it straight: the first paragraph of the answer is untrue (or at least incomplete). There is no *standard* and *non-standard* `array_agg()`. The function is commonly used to aggregate non-array values and has an option to work on arrays as well (in a limited manner). I allowed myself to point it out so that a potential reader would not be misled. – klin Aug 06 '21 at 15:58
@klin: Nothing is untrue here. I replaced the word "standard" with "plain" for your ease of mind. To be precise, since Postgres 9.5 there are two distinct variants of `array_agg()`, one taking non-array input, one taking array input. – Erwin Brandstetter Aug 06 '21 at 16:05
1

Yes, there are two variants of the function. It remains a mystery why you are providing a description of one that is unrelated to the problem. – klin Aug 06 '21 at 16:22

Aggregate single array of distinct elements from array column, excluding NULL

2 Answers2