3

Let's say I have a list of primary keys, for each row one value needs updating. Is it better to run:

-- run 10,000 of these queries
UPDATE mytable SET myflag = 1 WHERE id = [each_id]

Or combine updates into batch queries like this:

-- run 100 of these queries, where the IN () list contains about 100 elements
UPDATE mytable SET myflag = 1 WHERE id IN (3, 4, 5, 9, 99, ... 7887 )

How about 100 queries with 100 IN () items?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Michael Butler
  • 6,079
  • 3
  • 38
  • 46
  • 7
    Do some benchmarking? If you can test it on _your_ system it will be the most accurate for _you_ – TheZ Jun 29 '12 at 16:56
  • 2
    It can heavily depend on if you are using transactions or not / transaction around each `UPDATE` or around every 100, etc. – biziclop Jun 29 '12 at 16:57
  • 1
    Well, it is frightfully easy to test this.... But, as long as your `IN()` < `max_allowed_packet`, and especially if there's a key on `myflag`, it is _likely_ faster. – Wrikken Jun 29 '12 at 16:59
  • 3
    nitpick: the first case updates 10k rows, the second 100k rows, not sure if that's intentional? – fvu Jun 29 '12 at 16:59
  • 1
    If separate transactions for each update, multiple records at a time will win by a landslide in PostgreSQL simply because of the reduced number of transactions. It seems like this long list of numbers might be contained in a table somewhere, use a select on that table in the WHERE clause if possible. – Matt Jun 29 '12 at 17:05
  • @fvu: I fixed the wrong numbers. – Erwin Brandstetter Jun 29 '12 at 21:55

5 Answers5

5

Neither. In PostgreSQL I would instead:

WITH x AS (
   SELECT unnest('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
                 ,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40
                 ,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60
                 ,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
                 ,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
                 }'::int[]) AS id
   )
UPDATE mytable t
SET    myflag = 1
FROM   x
WHERE  t.id = x.id;

I put so many IDs in my example to give a visual clue that 10000 IDs is a lot. The two ideas presented in the question would either:

  1. have to parse the list and put together 10000 statements and send them to the server, which may very well take longer than the UPDATEs themselves.

  2. have to search in a list (array) of 10000 items for each individual id in mytable for a matching id. Standard indexes can't be used. This will be very slow. Performance degrades with the size of mytable.

An index on mytable.id is all the presented alternative needs to outperform both variants by an order of magnitude.

The CTE parses the array once (subquery works, too - MySQL has no CTEs) - and unnest() is rather fast with that. Doing it all in one statement beats 10000 statements by an order of magnitude. Add another order of magnitude if those statements are run in individual transactions. Add another one if you should use individual sessions.

Rare exceptions apply for databases with locking issues under heavy write load. Just benchmark as has been advised. EXPLAIN ANALYZE is your friend in PostgreSQL.

If the operation grows huge, and most of the table is updated and / or you are running low on disk space or RAM, it may still be a good idea to split the operation into several logical chunks - just not too many, find the sweet spot. Mostly to let HOT updates recycle table bloat from previous UPDATE runs. Consider this related question.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
4

I have found the second way to be orders of magnitude faster when doing inserts for very large data sets. This is very dependent on your system however as the IN part of the query will be more or less efficient depending on table size, indexes, etc.

Doing your own simple benchmarks is really the only way to go.

Matt Gibson
  • 14,616
  • 7
  • 47
  • 79
3

In the normal case it would be most efficient to run one update statement. E.g.,

UPDATE mytable set myflag=1 where id IN (select id from someothertable where stuff). 

It's possible that might be slower based on your schema. You should benchmark and find out.

Note, almost sure to be slower is running 10,000 statements from your client to the database server. Running 10,000 updates in a stored proc and running 10,000 updates from your client are two VERY different things. If you're going to go the run 10,000 updates route, make sure to do it in a SP.

aquinas
  • 23,318
  • 5
  • 58
  • 81
2

Typically, RDBMS round-trips are a dominating factor, but in this case the costs to parse the in list may get high, too. If you parameterize your queries, however, the second solution is very likely going to be faster, because parsing will be done only once.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • May be off topic, but how does one parameterize a variable list argument? In other words, something like IN (?) and then bind a variable length to the ?. – Michael Butler Jun 29 '12 at 22:22
  • @MichaelButler Unfortunately, no: you create a list of 100 individual `?`s; then you bind each variable separately. When you do the last "odd" batch, you generate another SQL with as many `?` as you've got remaining. – Sergey Kalinichenko Jun 30 '12 at 00:23
0

It depends mostly on the amount of fsyncs to the harddisk: That's the slowest part in your system.

For PostgreSQL: Do it in a small amount of transactions, if possible in just one transaction. But keep an eye on row locking, two transactions can't update the same row at the same time.

Frank Heikens
  • 117,544
  • 24
  • 142
  • 135
  • What if the table isn't indexed? Then it would have to do 10,000 tables scans as opposed to 1, correct? – aquinas Jun 29 '12 at 17:21
  • A table scan in it self isn't bad, it all depends. That's also why you have to take a look at EXPLAIN to see how a query is executed and if it has an index that can be used. – Frank Heikens Jun 29 '12 at 17:34