1

I am trying to update every row in order_item. Status is a newly created column, and must have the latest value from the order_update table. One item can have several updates.

I am using PostgreSQL 9.1

I have this update sql.
The table order_item has 800K records.
The table order_update has 5Mil records.

update order_item
set status = (
    select production_stage
    from order_update
    where id = (
        select max(id)
        from order_update
        where order_item_id = order_item.id
    )
);

How can I make this sql perform the best way. I know the update will take some time, just want to have it as fast as possible.

I found that when doing just this sql on 5Mil records.

select max(id) from order_update where order_item_id = 100;

Explain:

Result  (cost=784.10..784.11 rows=1 width=0)"   InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..784.10 rows=1 width=8)
          ->  Index Scan Backward using order_update_pkey on order_update  (cost=0.00..104694554.13 rows=133522 width=8)
                Index Cond: (id IS NOT NULL)
                Filter: (order_item_id = 100)

it takes about 6 seconds.

When I do the same sql in 1Mil records:
Explain:

Aggregate  (cost=13.43..13.44 rows=1 width=8)   ->  Index Scan using
order_update_order_item_id_idx on order_update  (cost=0.00..13.40
rows=11 width=8)
        Index Cond: (order_item_id = 100)

it takes around 11 ms.
11 ms vs. 6 sec. Why the HUGE diff?

To narrow it down a little I try this :

select id from order_update where order_item_id = 100 order by id asc
limit 1 
Total query runtime: 41 ms.

and then this :

select id from order_update where order_item_id = 100 order by id desc
limit 1 
Total query runtime: 5310 ms.

so a huge diff in asc and desc.

Solution : Create index :

CREATE INDEX order_update_mult_idx ON order_update (order_item_id, id DESC);

Update :

UPDATE order_item i
SET    test_print_provider_id = u.test_print_provider_id
FROM  (
   SELECT DISTINCT ON (1)
          test_print_provider_id
   FROM   orders
   ORDER  BY 1, id DESC
   ) u
WHERE  i.order_id = u.id
AND    i.test_print_provider_id IS DISTINCT FROM u.test_print_provider_id;
klind
  • 855
  • 2
  • 21
  • 33
  • Without intimate knowledge of PostgreSQL, I'd probably try putting the results of your aggregate query in a temp table and then referencing it in the `UPDATE` statement. – Yuck Nov 27 '13 at 00:45
  • Can you confirm, it looks like you want to update every single record in the `order_items` table (all 800k). Is that correct? I am asking because it feels like a "status" column would only be updated on certain records when needed. – Paul Richter Nov 27 '13 at 01:53
  • 1
    You should explain (in plain English) what you are trying to do. Also, as always, your version of Postgres and the relevant parts of the table definitions (`\d tbl` in psql) ... – Erwin Brandstetter Nov 27 '13 at 03:55

3 Answers3

3

My educated guess: this will be substantially faster.

UPDATE order_item i
SET    status = u.production_stage
FROM  (
   SELECT DISTINCT ON (1)
          order_item_id, production_stage
   FROM   order_update
   ORDER  BY 1, id DESC
   ) u
WHERE  i.id = u.order_item_id
AND    i.status IS DISTINCT FROM u.production_stage;   -- avoid empty updates
  • There is a subtle difference to the query in the question. The original one updates every row of order_item. If no matching rows in order_update are found, this result in status being set to NULL. This query leaves those rows alone (original value kept, no update).

  • Detailed explanation for the subquery with DISTINCT ON in this closely related answer:
    Select first row in each GROUP BY group?

  • Generally, a single subquery should outperform your approach with correlated subqueries easily. Even more so with an optimized query.

  • If order_item.status should be defined NOT NULL, the last line can be simplified with <>.

  • A multicolumn index like this might help:

    CREATE INDEX order_update_mult_idx ON order_update(order_item_id, id DESC);
    

    The descending order on the second column is essential.
    However, since you are using all or most of both tables in a single scan, an index is probably not going to help. Except for a covering index, maybe, in Postgres 9.2 or later:

    CREATE INDEX order_update_mult_idx
    ON order_update(order_item_id, id DESC, production_stage);
    

EXPLAIN only gives you the plan Postgres came up with. These numbers can be way off if the planner estimates and cost parameters are not set accurately. To get actual performance data, you would have to run EXPLAIN ANALYZE - which will take a long time for big tables, of course, since it test-executes the query.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

It will help if you have an index in order_update on the Id that includes order_item_id and production_stage. Other than that, this is fairly straightforward. Using a temporary table instead of a subquery might be an option, but I don't see much else that can be improved.

David Garrison
  • 2,546
  • 15
  • 25
0

What about the following reconstruct?

update order_item
set status = (
    select a.production_stage from (
        select ou.id, ou.production_stage
        from order_update ou
        where ou.order_item_id = order_item.id
        order by ou.id desc
    ) a limit 1
);

EDIT: Since the above is slower, what about the following reconstruct?

update order_item
set status = (
    select a.production_stage from (
/********************************************** INNER QUERY START **/
        select ou.order_item_id, ou.production_stage
        from order_update ou
        INNER JOIN (
            select order_item_id, max(id) as max_id
            from order_update
            group by order_item_id
        ) ou_max ON (ou.order_item_id = ou_max.order_item_id
                     AND ou.id = ou_max.max_id)
/********************************************** INNER QUERY END **/
    ) a where a.order_item_id = order_item.id
);

In this, your DBMS will execute the inner query only once to create the temporary table A. After this, it will simply act like: update order_item set status = (select a.production_stage from a where a.order_item_id = order_item.id);. This will be very fast because A is already created and available as a fixed table for the entire update - it is not re-created for each order_item_id.

ADTC
  • 8,999
  • 5
  • 68
  • 93
  • This one is actually slower than : update order_item set status = (select production_stage from order_update where id = ( select max(id) from order_update where order_item_id = order_item.id)); – klind Dec 02 '13 at 17:39