Is it wasteful to use ARRAY_AGG to get the first non-NULL value in a column?

Question

I use PostgreSQL 14 to manage a table which stores updates to a table of medics: users can update the first name, last name, and or/ the age of the medic. A field which was not touched by an update operation has a NULL value.

Here's an example of four edits touching two separate medics. The medic with ID 3 received three updates: the first two are updating the age field, the third one touches the first_name:

SELECT * FROM medic_edits;

id	medic_id	first_name	age
1	1	Indy
2	3		59
3	3		63
4	3	Bob

I would like to merge this table such that in the resulting table there's one row per medic, giving the cumulative edits. This is my current query and the output it produces:

SELECT
  medic_id,
  (ARRAY_REMOVE(ARRAY_AGG(first_name ORDER BY id DESC), NULL))[1] AS first_name,
  (ARRAY_REMOVE(ARRAY_AGG(last_name ORDER BY id DESC), NULL))[1] AS last_name,
  (ARRAY_REMOVE(ARRAY_AGG(age ORDER BY id DESC), NULL))[1] AS last_name
FROM medic_edits
GROUP BY medic_id
;

medic_id	first_name	last_name	last_name
1	Indy
3	Bob		63

This is exactly the output I expected, but I suspect that the ARRAY_REMOVE/ARRAY_AGG logic is a bit wasteful. I wonder if there is a way to use partitions for good profit here, the FIRST_VALUE function looks very relevant.

Erwin Brandstetter · Accepted Answer · 2022-09-17T14:30:33.180

4

Yes, it's wasteful. I expect this to be faster:

SELECT DISTINCT ON (medic_id)
       medic_id
     , first_value(first_name) OVER (PARTITION BY medic_id ORDER BY CASE WHEN first_name IS NOT NULL THEN id END) AS first_name
     , first_value(last_name)  OVER (PARTITION BY medic_id ORDER BY CASE WHEN last_name  IS NOT NULL THEN id END) AS last_name
     , first_value(age)        OVER (PARTITION BY medic_id ORDER BY CASE WHEN age        IS NOT NULL THEN id END) AS age
FROM   medic_edits;

For descending id value, use instead:

       first_value(first_name) OVER (PARTITION BY medic_id ORDER BY CASE WHEN first_name IS NOT NULL THEN id END DESC NULLS LAST) AS first_name

See:

Sort by column ASC, but NULL values first?

But there are probably faster ways, yet. Also depends on the exact table definition, cardinalities, and data distribution.

See:

Fetch a row that contains the set of last non-NULL values for each column

About DISTINCT ON:

Select first row in each GROUP BY group?

Works in a single SELECT because DISTINCT or DISTINCT ON are applied after window functions. See:

Best way to get result count before LIMIT was applied

Aside: "age" is going to bit-rot rapidly. It's typically superior to store a birthday.

edited Sep 17 '22 at 14:30

answered Sep 17 '22 at 14:02

Erwin Brandstetter

605,456
145
1,078
1,228

Thanks a lot for referencing that other question - it seems very relevant to what I'm trying to achieve! – Frerich Raabe Sep 17 '22 at 14:07
You'll find a variety of matching solutions there. The winner depends on undisclosed details. But note the important difference: you are looking for one row per group, while the other question is after a single row. – Erwin Brandstetter Sep 17 '22 at 14:18
1

Never seen bit-rot used with this meaning, but I find it a very apt extension of the standard meaning – Jonathan Willcock Sep 17 '22 at 14:20
I readily confess that `age` wasn't the best example datum I could have picked. :-) I needed to adjust the expression in this answer a little bit to account for the fact that each group should be sorted in _descending_ order (such that the most recent edit is used), e.g. `first_value(age) OVER (PARTITION BY medic_id ORDER BY CASE WHEN age IS NULL THEN 0 ELSE id END DESC)`, taking advantage of `id` values starting at 1. – Frerich Raabe Sep 17 '22 at 14:27
@FrerichRaabe: I added a faster & cleaner option for descending order. – Erwin Brandstetter Sep 17 '22 at 14:31
@ErwinBrandstetter Huh, interesting, let me study that! I'll have to read up to see if the "repeated" partitioning would be an issue if I had more than just three columns to deal with. Thanks for that version based on `NULLS LAST`, I totally forgot that this is a thing! :-) – Frerich Raabe Sep 17 '22 at 14:32

score 0 · Answer 2 · answered Sep 17 '22 at 14:58

I have written two example queries for you that give the same results:

select
    medic_id,
    max(first_name) as first_name,
    max(last_name) as last_name,
    max(age) as age  
from medic_edits
group by medic_id 


select distinct 
    medic_id,  
    first_value(first_name) over (partition by medic_id order by first_name desc) as first_name,
    first_value(last_name) over (partition by medic_id  order by last_name desc) as last_name,
    max(age) over (PARTITION BY medic_id) as age
from medic_edits


Results:
medic_id    first_name  last_name   age
1           Indy        
3           Bob                     63

Neither does the right thing. `max()` is only applicable if there is a single non-null value in the group. — Erwin Brandstetter, Sep 17 '22 at 15:01

score 0 · Answer 3 · answered Sep 17 '22 at 15:36

Sorry I don't understand your question correctly, I wrote new query, this is right.

select distinct 
    medic_id,  
    first_value(coalesce(first_name)) over (partition by medic_id order by case when first_name is not null then 0 else 1 end ASC, id desc) as first_name,
    first_value(coalesce(last_name)) over (partition by medic_id  order by case when last_name is not null then 0 else 1 end ASC, id desc) as last_name,
    first_value(coalesce(age)) over (partition by medic_id  order by case when age is not null then 0 else 1 end ASC, id desc) as age
from medic_edits

Is it wasteful to use ARRAY_AGG to get the first non-NULL value in a column?

3 Answers3