6

I have a table with a primary id column (automatically indexed), two sub-id columns (also indexed), and 12 columns of type geometry(Polygon). If I am going to update ~2 million rows on this table, is it faster for me to run ~2 million individual update statements like

update TABLE_NAME set ( COLUMNS ) = ( VALUES ) where ID_COLUMN = NEXT_ID

or is it faster to do some smaller number of larger update statements like in this answer

update TABLE_NAME as update_t set
COLUMNS = new_vals.COLUMNS
from (values
(id, polygon1val, polygon2val, ... polygon12val),   /* row 1 */
(id, polygon1val, polygon2val, ... polygon12val),   /* row 2 */
...                                                 /* ...   */
(id, polygon1val, polygon2val, ... polygon12val)    /* row N */
) as new_vals( COLUMNS )
where new_vals.id = update_t.id

If the latter, do you have any suggestions on what a good N might be? Is N = ~2mil, or some smaller subset (that I would repeat until they're all done)?

EDIT: Obviously, in the former case I would use a prepared statement. But I also wonder, in the latter case, is there any benefit in trying to use a prepared statement?

I'm using PostgreSQL 9.2.

Community
  • 1
  • 1
caps
  • 1,225
  • 14
  • 24
  • 1
    Re a good value of `N`: There's usually almost no limit below which individual statements are better, but above a certain workload-dependent point it can be worth creating a temporary table, `COPY`ing the data into it, and creating any useful indexes on the table then doing your joined update against the temp table. PostgreSQL doesn't know how to materialize and index a VALUES-list when it's faster to do so. – Craig Ringer Dec 09 '13 at 04:45

1 Answers1

7

In general, the better you can batch operations into sets the more options the database has of making things fast. If you run the updates individually, the only option is something like "locate the one row affected, delete it, insert new one"

If you can batch the updates then the planner gets to decide whether a sequential scan may be faster than a bunch of index scans (and it may well be, since you get to leverage read-ahead caching). In other words, one command updating a lot of rows almost always performs better than a lot of commands updating a single row, even aside from planning overhead.

Chris Travers
  • 25,424
  • 6
  • 65
  • 182
  • Yep, and there's also query setup/teardown time, client/server round trip latencies (even on localhost you've got context switch and re-scheduling costs), etc. Most of which don't go away with the use of prepared statements. – Craig Ringer Dec 09 '13 at 04:43
  • Interesting. My experience was that the individual updates (or batches of 3-5) were faster. I wonder if I was doing something wrong. – caps Dec 09 '13 at 16:48
  • 2
    I'm not sure but I think that batch update have a number of records that when met start to be slower. In other words, batch updates are good but if your batch size is so long as it fit all database server machine resources like RAM then you will lose the batch update benefits. – deFreitas Apr 11 '18 at 00:53