2

It is conventional wisdom that set based processing of tables should always be preferred over RBAR - especially when the tables grow larger and/or you need to update many rows.

But does that always hold? I have experienced quite a few situations - on different hardware - where set-based processing shows exponential growth in time consumption, while splitting the same workload into smaller chunks gives linear growth.

I think it would be interesting either to be proven totally wrong - if I'm missing something obvious - or, if not, it would be very good to know when splitting the workload is worth the effort. And subsequently identifying what indicators help make the decision of which approach to use. I'm personally expecting the following components to be interesting:

  • Size of workload
  • Size and growth of logfile
  • Amount of RAM
  • Speed of disksystem

Any other? Number of CPUs/CPU cores?

Example 1: I have a 12 million row table and I have to update one or two fields in each row with data from another table. If I do this in one simple UPDATE, this takes ~30 minutes on my test box. But I'll be done in ~24 minutes if I split this into twelve chunks - ie.:

WHERE <key> BETWEEN 0 AND 1000000
WHERE <key> BETWEEN 1000000 AND 2000000
...

Example 2: Is a 200+ million rows table that also need to have several calculations done to practically all rows. If a do the full set all in one, my box will run for three days and not even then be done. If I write a simple C# to execute the exact same SQL, but with WHERE-clauses appended to limit transaction size to 100k rows at a time, it'll be done in ~14 hours.

For the record: My results are from the same databases, resting on the same physical hardware, with statistics updated, no changes in indexes, Simple recovery model, etc.

And no, I haven't tried 'true' RBAR, although I probably should - even though it would only be to see how long that would really take.

  • Are the 12 chunks in one transaction? – gbn Sep 03 '11 at 16:30
  • Nope, separately committed transactions. In fact, the 24 minutes includes me monitoring Management Studio, editing the WHERE-clause and hitting F5 to execute next chunk. –  Sep 03 '11 at 16:32
  • 2
    I'll comment rather than answer; I have no method of quantification. But I do know that performance can degrade as transactions grow in size (log file use), and that performance can degrade when data is being spooled to and from disk rather than staying in memory (joining massive tables without being able to use a merge join). So, there are certainly cases where scale makes breaking a problem into pieces a positive step. It does mean that you can create a "dirty state"; part of your DB is updated, and part is not. This can often be managed by carefully choosing how to break the problem up. – MatBailie Sep 03 '11 at 17:08

1 Answers1

3

No, there is no rule that set-based is always faster. We have cursors for a reason (and don't be fooled into believing that a while loop or some other type of looping is really all that different from a cursor). Itzik Ben-Gan has demonstrated a few cases where cursors are much better, particularly for running totals problems. There are also cases you describe where you're trying to update 12 million rows and due to memory constraints, log usage or other reasons it's just too much for SQL to handle as a single operation without having to spill to tempdb, or settle on a sub-optimal plan from early termination due to not getting a more optimal plan quick enough.

One of the reasons cursors get a bad rap is that people are lazy and just say:

DECLARE c CURSOR FOR SELECT ...

When they almost always should be saying:

DECLARE c CURSOR 
    LOCAL FORWARD_ONLY STATIC READ_ONLY 
    FOR SELECT ...

This is because those extra keywords make the cursor more efficient for various reasons. Based on the documentation you would expect some of those options to be redundant, but in my testing this is not the case. See this blog post of mine and this blog post from fellow SQL Server MVP Hugo Kornelis for more details.

That all said, in most cases your best bet is going to be set-based (or at least chunky set-based as you described above). But for one-off admin tasks (which I hope your 12-million row update is), it is sometimes easier / more efficient to just write a cursor than to spend a lot of effort constructing an optimal query that produces an adequate plan. For queries that will be run a lot as normal operations within the scope of the application, those are worth more effort to try to optimize as set-based (keeping in mind that you may still end up with a cursor).

Community
  • 1
  • 1
Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • You will note that I used the term "Conventional Wisdom" - meant as described on Wikipedia, i.e. not necessarily true and often an obstacle to the acceptance of new information. Unfortunately, this isn't really a one-off admin task, but something that happens once or twice pr. month. –  Sep 07 '11 at 10:10