Optimise a simple update query for large dataset

Question

I have some data migration that has to occur between a parent and child table. For the sake of simplicity, the schemas are as follows:

 -------    -----------
| event |  | parameter |
 -------    -----------
| id    |  | id        |
| order |  | eventId   |
 -------   | order     |
            -----------

Because of an oversight with business logic that needs to be performed, we need to update parameter.order to the parent event.order. I have come up with the following SQL to do that:

UPDATE "parameter"
SET "order" = e."order"
FROM "event" e
WHERE "eventId" = e.id

The problem is that this query didn't resolve after over 4 hours and I had to clock out, so I cancelled it.

There are 11 million rows on parameter and 4 million rows on event. I've run EXPLAIN on the query and it tells me this:

Update on parameter (cost=706691.80..1706622.39 rows=11217313 width=155)
  ->  Hash Join  (cost=706691.80..1706622.39 rows=11217313 width=155)
        Hash Cond: (parameter."eventId" = e.id)
        ->  Seq Scan on parameter  (cost=0.00..435684.13 rows=11217313 width=145)
        ->  Hash  (cost=557324.91..557324.91 rows=7724791 width=26)
              ->  Seq Scan on event e  (cost=0.00..557324.91 rows=7724791 width=26)

Based on this article it tells me that the "cost" referenced by the EXPLAIN is an "arbitrary unit of computation".

Ultimately, this update needs to be performed, but I would accept it happening in one of two ways:

I am advised of a better way to do this query that executes in a timely manner (I'm open to all suggestions, including updating schemas, indexing, etc.)
The query remains the same but I can somehow get an accurate prediction of execution time (even if it's hours long). This way, at least, I can manage the expectations of the team. I understand that without actually running the query it can't be expected to know the times, but is there an easy way to "convert" these arbitrary units into some millisecond execution time?

Edit for Jim Jones' comment:

I executed the following query:

SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid

I got 9 identical rows like the following:

pid   | locktype | mode            | query       | query-start         | state
-------------------------------------------------------------------------
23192 | relation | AccessShareLock | <see below> | 2021-10-26 14:10:01 | active

query column:

--update parameter
--set "order" = e."order"
--from "event" e
--where "eventId" = e.id
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid

Edit 2: I think I've been stupid here... The query produced by checking these locks is just the commented query. I think that means there's actually nothing to report.

Did you check for a row lock? If any other proces locks a record, your update has to wait. — Frank Heikens, Oct 26 '21 at 12:43
anything on `SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid ` ? — Jim Jones, Oct 26 '21 at 12:47
*For the sake of simplicity,* provide `CREATE TABLE` scripts showing data types and constraints. And always your version of Postgres. — Erwin Brandstetter, Oct 26 '21 at 12:58
@JimJones Yes actually! I see a lot of `AccessShareLock` rows that reference the update query. They all say `active`. Any advice? — Ollie, Oct 26 '21 at 13:10
@JimJones I've just updated my question with the results of the query you suggested — Ollie, Oct 26 '21 at 13:16
As Erwin asked, what version is this? Also, if you change the update to a `SELECT *` from the same join, what do you get from an `EXPLAIN (ANALYZE)`? Specifically, how many batches does the hash join use? — jjanes, Oct 26 '21 at 21:04

mike.k · Answer 1 · 2021-10-26T14:36:54.203

1

The query will be slow because for each UPDATE operation, it has to look up the index by id. Even with an index, on a large table, this is a per-row read/write so it is slow.

I'm not sure how to get a good estimate, maybe do 1% of the table and multiply?

I suggest creating a new table, then dropping the old one and renaming the new table.

CREATE TABLE parameter_new AS
SELECT
  parameter.id,
  parameter."eventId",
  e."order"
FROM
  parameter
JOIN event AS "e" ON
  "e".id = parameter."eventId"

Later, once you verify things:

ALTER TABLE parameter RENAME TO parameter_old;
ALTER TABLE parameter_new RENAME TO parameter;

Later, once you're completely certain:

DROP TABLE parameter_old;

edited Oct 26 '21 at 14:36

answered Oct 26 '21 at 12:45

mike.k

3,277
1
12
18

The query plan reveals that no index is used at all. But if there are any indexes, the OP has to do more: recreate those indexes on the new table ... – Erwin Brandstetter Oct 26 '21 at 12:59
Thanks for your suggestion Mike. I realise now I said I was open to all suggestions however I think creating a new table isn't going to be possible in this instance because of having to correctly copy all constraints and indexing over - not to mention we'll be jumping ship from this broken schema soon. – Ollie Oct 26 '21 at 13:08

score 1 · Answer 2 · answered Oct 26 '21 at 12:55

If some rows already have the target value, you can skip empty updates (at full cost). Like:

UPDATE parameter p
SET    "order" = e."order"
FROM   event e
WHERE  p."eventId" = e.id
AND    p."order" IS DISTINCT FROM e."order";  -- this

If both "order" columns are defined NOT NULL, simplify to:

...
AND    p."order" <> e."order";

See:

How do I (or can I) SELECT DISTINCT on multiple columns?

If you have to update all or most rows - and can afford it! - writing a new table may be cheaper overall, like Mike already mentioned. But concurrency and depending objects may stand in the way.

Aside: use legal, lower-case identifiers, so you don't have to double-quote. Makes your life with Postgres easier.

Thanks for your response. Annoyingly, none of them will share the value since they both use the same sequence as the default. And agreed - the constant double quoting is the bane of my existence! Sadly I wasn't the one who originally designed the schema. — Ollie, Oct 26 '21 at 13:06

Optimise a simple update query for large dataset

2 Answers2