5

I am attempting to update multiple columns on a table with values from another row in the same table:

CREATE TEMP TABLE person (
  pid INT
, name VARCHAR(40)
, dob DATE
, younger_sibling_name VARCHAR(40)
, younger_sibling_dob DATE
);

INSERT INTO person VALUES (pid, name, dob)
  (1, 'John' , '1980-01-05')
, (2, 'Jimmy', '1975-04-25')
, (3, 'Sarah', '2004-02-10')
, (4, 'Frank', '1934-12-12')
;

The task is to populate younger_sibling_name and younger_sibling_dob with the name and birthday of the person that is closest to them in age, but not older or the same age.

I can set the younger sibling dob easily because this is the value that determines the record to use with a correlated subquery (I think this is an example of that?):

UPDATE person SET younger_sibling_dob = (
SELECT MAX(dob)
FROM person AS sibling
WHERE sibling.dob < person.dob);

I just can't see any way to get the name?
The real query of this will run over about 1M rows in groups of 100-500 for each MAX selection so performance is a concern.

Edit

After trying many different approaches, I've decided on this one which I think is a good balance of being able to verify the data with the intermediate result, shows the intention of what the logic is, and performs adequately:

WITH sibling AS (
  SELECT person.pid, sibling.dob, sibling.name,
         row_number() OVER (PARTITION BY person.pid
                            ORDER BY sibling.dob DESC) AS age_closeness
  FROM person
  JOIN person AS sibling ON sibling.dob < person.dob
)
UPDATE person
  SET younger_sibling_name = sibling.name
     ,younger_sibling_dob  = sibling.dob
FROM sibling
WHERE person.pid = sibling.pid
   AND sibling.age_closeness = 1;

SELECT * FROM person ORDER BY dob;
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Jeremy
  • 1,397
  • 2
  • 13
  • 20
  • BTW: if you are so obsessed with performance, you could store a `sibling_id REFERENCES person(pid)` instead of the duplicated {sibling_name,sibling_dob} – wildplasser Mar 20 '13 at 17:40

3 Answers3

8

Rewrite 2022

I expect your added solution to perform poorly, as it's doing a of of unnecessary work. The following should be much faster.

The question and the added solution do not define which row to pick when there are multiple with the same dob. Typically you'll want a deterministic pick. This query picks the alphabetically first name from each group of peers with the same dob. Adapt to your needs.

UPDATE person p
SET    younger_sibling_name = y.name
     , younger_sibling_dob  = y.dob
FROM  (
   SELECT dob, name, lead(dob) OVER (ORDER BY dob) AS next_dob
   FROM  (
      SELECT DISTINCT ON (dob)
             dob, name
      FROM   person p
      ORDER  BY dob, name  -- ①
      ) sub
   ) y
WHERE  p.dob = y.next_dob;

db<>fiddle here - with extended test case

Works since at least Postgres 8.4.

Needs an index on dob to be fast, ideally a multicolumn index on (dob, name).

Subquery sub passes over the whole table once and distills distinct rows per dob.

① I added name to ORDER BY as tiebreaker to pick the row with the alphabetically first name. Adapt to our needs.

In the outer SELECT add the next later dob (next_dob) to each row with lead() - simple now with distinct dob. Then join to that next_dob and the rest is simple.

If no younger person exists, no UPDATE happens and the columns stay NULL.

About DISTINCT ON and possibly faster query techniques for many duplicates:

Taking dob and name from the same row guarantees we stay in sync. Multiple correlated subqueries would not offer this guarantee, and would be more expensive anyway.

Original answer

Still valid.

Old query 1

WITH cte AS (
   SELECT *, dense_rank() OVER (ORDER BY dob) AS drk
   FROM   person
    )
UPDATE person p
SET    younger_sibling_name = y.name
     , younger_sibling_dob  = y.dob
FROM   cte x
JOIN   (SELECT DISTINCT ON (drk) * FROM cte) y ON y.drk = x.drk - 1
WHERE  x.pid = p.pid;

Old sqlfiddle

In the CTE cte use the window function dense_rank() to get a rank without gaps according to the dop for every person.

Join cte to itself, but remove duplicates on dob from the second instance. Thereby everybody gets exactly one UPDATE. If more than one person share the same dop, the same one is selected as younger sibling for all persons on the next dob. I do this with:

   (SELECT DISTINCT ON (rnk) * FROM cte)

Add ORDER BY rnk, ... to this subquery to pick a particular person for every dob.

Old query 2

WITH cte AS (
   SELECT dob, min(name) AS name
        , row_number() OVER (ORDER BY dob) rn
   FROM   person p
   GROUP  BY dob
   )
UPDATE person p
SET    younger_sibling_name = y.name
     , younger_sibling_dob  = y.dob
FROM   cte x
JOIN   cte y ON y.rn = x.rn - 1
WHERE  x.dob = p.dob;

Old sqlfiddle

This works, because aggregate functions are applied before window functions. And it should be very fast since both operations agree on the sort order.

Obviates the need for a later DISTINCT like in query 1.

Result is the same as query 1, exactly.
Again, you can add more columns to ORDER BY to pick a particular person for every dob.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks for this alternative implementation. My lack of knowledge about window functions makes this approach very interesting. I'm going to do some sample queries over a real data set to get a feel for this. – Jeremy Mar 20 '13 at 03:25
  • @Jeremy: On revisiting, I think I found an even faster solution. – Erwin Brandstetter Mar 20 '13 at 04:19
  • Although I am not done with what I'm working on, this has shown me the right direction to take. To help maintenance, I have opted on doing simple window function calculations inside of SELECT CTE's and then deleting out the records that don't match the criteria (ranks, minimum value over range, etc) I'm looking for. Although not as fast as your method, I'll have a better shot at understanding it 6 months from now and it's fast enough for the time being. – Jeremy Mar 20 '13 at 05:41
  • BTW: CTEs form a planning barrier (they will never be combined/merged with the main query) This may cost performance if better plans can be achieved by merging the main query with its subqueries. (such as my `max() as NOT EXISTS()` or joined subselect.) Both would (given the right indexing) only need one indexed/ordered scan over the table instead of two. YMMV. – wildplasser Mar 20 '13 at 23:31
2

1) Finding the MAX() can alway be rewritten in terms of NOT EXISTS (...)

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM person src
WHERE src.dob < dst.dob
   OR src.dob = dst.dob AND src.pid < dst.pid
AND NOT EXISTS (
        SELECT * FROM person nx
        WHERE nx.dob < dst.dob
           OR nx.dob = dst.dob AND nx.pid < dst.pid
        AND nx.dob > src.dob
           OR nx.dob = src.dob AND nx.pid > src.pid
        );

2) Instead of rank() / row_number(), you could also use a LAG() function over the WINDOW:

UPDATE person dst
SET younger_sibling_name = src.name
        ,younger_sibling_dob = src.dob
FROM    (
        SELECT pid
        , LAG(name) OVER win AS name
        , LAG(dob) OVER win AS dob 
        FROM person
        WINDOW win AS (ORDER BY dob, pid)
        ) src
WHERE src.pid = dst.pid
        ;

Both versions require a self-joined subquery (or CTE) because UPDATE does not allow window functions.

wildplasser
  • 43,142
  • 8
  • 66
  • 109
1

To get the dob and name, you can do:

update person
    set younger_sibling_dob = (select dob
                               from person p2
                               where s.dob < person.dob
                               order by dob desc
                               limit 1),
       younger_sibling_name = (select name
                               from person p2
                               where s.dob < person.dob
                               order by dob desc
                               limit 1)

If you have an index on dob, then the query will run faster.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 2
    By using an `update person from (...) ` both sub-selects can be combined into a single one which is probably more efficient. –  Mar 19 '13 at 23:27
  • @a_horse_with_no_name: Right. Just a little more complex than that, since the correlated subqueries are functionally dependent on the current row. – Erwin Brandstetter Mar 20 '13 at 02:00
  • Thanks for the approach. I actually did try this in testing before the post. Unfortunately the cost of the query was so high that it never finished executing for me to verify the results. – Jeremy Mar 20 '13 at 03:27