59

I've an UPDATE statement which can update more than million records. I want to update them in batches of 1000 or 10000. I tried with @@ROWCOUNT but I am unable to get desired result.

Just for testing purpose what I did is, I selected table with 14 records and set a row count of 5. This query is supposed to update records in 5, 5 and 4 but it just updates first 5 records.

Query - 1:

SET ROWCOUNT 5

UPDATE TableName 
SET Value = 'abc1' 
WHERE Parameter1 = 'abc' AND Parameter2 = 123

WHILE @@ROWCOUNT > 0
BEGIN
    SET rowcount 5

    UPDATE TableName 
    SET Value = 'abc1' 
    WHERE Parameter1 = 'abc' AND Parameter2 = 123

    PRINT (@@ROWCOUNT)
END

SET rowcount 0

Query - 2:

SET ROWCOUNT  5

WHILE (@@ROWCOUNT > 0)
BEGIN
    BEGIN TRANSACTION

    UPDATE TableName 
    SET Value = 'abc1' 
    WHERE Parameter1 = 'abc' AND Parameter2 = 123

    PRINT (@@ROWCOUNT)

    IF @@ROWCOUNT = 0
    BEGIN
        COMMIT TRANSACTION

        BREAK
    END

    COMMIT TRANSACTION
END

SET ROWCOUNT  0

What am I missing here?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
GThree
  • 2,708
  • 7
  • 34
  • 67

7 Answers7

91
  1. You should not be updating 10k rows in a set unless you are certain that the operation is getting Page Locks (due to multiple rows per page being part of the UPDATE operation). The issue is that Lock Escalation (from either Row or Page to Table locks) occurs at 5000 locks. So it is safest to keep it just below 5000, just in case the operation is using Row Locks.

  2. You should not be using SET ROWCOUNT to limit the number of rows that will be modified. There are two issues here:

    1. It has that been deprecated since SQL Server 2005 was released (11 years ago):

      Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax

    2. It can affect more than just the statement you are dealing with:

      Setting the SET ROWCOUNT option causes most Transact-SQL statements to stop processing when they have been affected by the specified number of rows. This includes triggers. The ROWCOUNT option does not affect dynamic cursors, but it does limit the rowset of keyset and insensitive cursors. This option should be used with caution.

    Instead, use the TOP () clause.

  3. There is no purpose in having an explicit transaction here. It complicates the code and you have no handling for a ROLLBACK, which isn't even needed since each statement is its own transaction (i.e. auto-commit).

  4. Assuming you find a reason to keep the explicit transaction, then you do not have a TRY / CATCH structure. Please see my answer on DBA.StackExchange for a TRY / CATCH template that handles transactions:

    Are we required to handle Transaction in C# Code as well as in Store procedure

I suspect that the real WHERE clause is not being shown in the example code in the Question, so simply relying upon what has been shown, a better model (please see note below regarding performance) would be:

DECLARE @Rows INT,
        @BatchSize INT; -- keep below 5000 to be safe
    
SET @BatchSize = 2000;

SET @Rows = @BatchSize; -- initialize just to enter the loop

BEGIN TRY    
  WHILE (@Rows = @BatchSize)
  BEGIN
      UPDATE TOP (@BatchSize) tab
      SET    tab.Value = 'abc1'
      FROM  TableName tab
      WHERE tab.Parameter1 = 'abc'
      AND   tab.Parameter2 = 123
      AND   tab.Value <> 'abc1' COLLATE Latin1_General_100_BIN2;
      -- Use a binary Collation (ending in _BIN2, not _BIN) to make sure
      -- that you don't skip differences that compare the same due to
      -- insensitivity of case, accent, etc, or linguistic equivalence.

      SET @Rows = @@ROWCOUNT;
  END;
END TRY
BEGIN CATCH
  RAISERROR(stuff);
  RETURN;
END CATCH;

By testing @Rows against @BatchSize, you can avoid that final UPDATE query (in most cases) because the final set is typically some number of rows less than @BatchSize, in which case we know that there are no more to process (which is what you see in the output shown in your answer). Only in those cases where the final set of rows is equal to @BatchSize will this code run a final UPDATE affecting 0 rows.

I also added a condition to the WHERE clause to prevent rows that have already been updated from being updated again.

NOTE REGARDING PERFORMANCE

I emphasized "better" above (as in, "this is a better model") because this has several improvements over the O.P.'s original code, and works fine in many cases, but is not perfect for all cases. For tables of at least a certain size (which varies due to several factors so I can't be more specific), performance will degrade as there are fewer rows to fix if either:

  1. there is no index to support the query, or
  2. there is an index, but at least one column in the WHERE clause is a string data type that does not use a binary collation, hence a COLLATE clause is added to the query here to force the binary collation, and doing so invalidates the index (for this particular query).

This is the situation that @mikesigs encountered, thus requiring a different approach. The updated method copies the IDs for all rows to be updated into a temporary table, then uses that temp table to INNER JOIN to the table being updated on the clustered index key column(s). (It's important to capture and join on the clustered index columns, whether or not those are the primary key columns!).

Please see @mikesigs answer below for details. The approach shown in that answer is a very effective pattern that I have used myself on many occasions. The only changes I would make are:

  1. Explicitly create the #targetIds table rather than using SELECT INTO...
  2. For the #targetIds table, declare a clustered primary key on the column(s).
  3. For the #batchIds table, declare a clustered primary key on the column(s).
  4. For inserting into #targetIds, use INSERT INTO #targetIds (column_name(s)) SELECT and remove the ORDER BY as it's unnecessary.

So, if you don't have an index that can be used for this operation, and can't temporarily create one that will actually work (a filtered index might work, depending on your WHERE clause for the UPDATE query), then try the approach shown in @mikesigs answer (and if you use that solution, please up-vote it).

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • What if I am not updating one single column but let's say 10 columns? Do I have to compare the values for all the columns? What would be the most performant approach? – asemprini87 Feb 14 '18 at 20:22
  • @asemprini87 Compare as much as possible to reduce unnecessary updates as they take longer and grow the log file more. I just updated my answer to include forcing a binary Collation for the `Value` filter, but you can use that `COLLATE Latin1_General_100_BIN2` on any other string column to speed up string matching, assuming that you are only looking for exact matches and don't need to allow for casing differences, etc. I was thinking of mentioning creating a filtered index for this operation, but it might not be faster if it has to be updated for each batch. Might be worth testing, though. – Solomon Rutzky Feb 14 '18 at 20:34
  • @SolomonRutzky :Say that while updating millions of record if few records got failed then how would you log those records which fails to update.Example.Out of 1million records 5millioins records updated succesfully and very next five reocords got failed to be updated and rest of the records updated successfully then how would you log these 5 failed records out of a million records. – MiniSu Mar 31 '19 at 08:54
  • @RockingDev To detect individual error rows, you would need to add additional logic within the loop to process smaller batch sizes until you got to the row that errored, then I guess throw that key value into a temp table, then reset the batch size back to the original value. This requires creating the temp table first, and then adding it to the `WHERE` clause via something like: `AND NOT EXISTS (SELECT * FROM #ErrorRows err WHERE err.KeyVal = tab.ClusteredIndexKey)`. Something like that. When the process is done, any rows in #ErrorRows need to be dealt with. – Solomon Rutzky Apr 05 '19 at 21:37
  • SolomonRutzky and @mikesigs thanks for the answers they indeed speed up the update queries, so much that I can't help wondering: why can't SQL server make these optimizations internally when we use a simple UPDATE query? Maybe not the batch optimization, but the temporary table optimization seems easy to wrap in a first glance. Do you have any idea or am I missing something? – kitsiosk Sep 16 '22 at 08:28
  • 1
    @kitsiosk My guess is that breaking the operation into separate steps to include a temp table to do the main operation by clustered key is a bit too complex and could introduce anomalies as they aren't exactly the same operation, even if very similar. And would be even more different if then batching it as that would require creating and populating the temp table for each `UPDATE` instead of once, which diminishes some of the benefit of that approach. – Solomon Rutzky Sep 16 '22 at 16:47
  • @SolomonRutzky "could introduce anomalies as they aren't exactly the same operation" can you please mention one such example? I'm really curious to see where the proposed approach fails but the plain UPDATE succeeds – kitsiosk Sep 19 '22 at 05:56
  • @kitsiosk Well, I guess it depends on if a regular `UPDATE` is a sensitive operation or not. meaning, if rows that qualify based on `WHERE` / `JOIN` clauses at the beginning get updated such that they no longer qualify by the time the row is physically being updated, will it still update? if the keys are copied to a temp table then the operation is insensitive. I don't have a test for this but assumed that the operation would be sensitive by default, just like with cursors. – Solomon Rutzky Sep 27 '22 at 20:35
32
WHILE EXISTS (SELECT * FROM TableName WHERE Value <> 'abc1' AND Parameter1 = 'abc' AND Parameter2 = 123)
BEGIN
UPDATE TOP (1000) TableName
SET Value = 'abc1'
WHERE Parameter1 = 'abc' AND Parameter2 = 123 AND Value <> 'abc1'
END
msitt
  • 1,237
  • 12
  • 27
Kramb
  • 1,082
  • 10
  • 18
  • I've updated this from 1000 to 4000 and seems to be working fine so far. In one table I'm updating 5 million records (seems to be updating about 744,000 records every 10 mins). I'm running this on a dev server and will be trying to update 26 million records next. Currently looking to see if there is a way to speed the process up by "multi-threading". – PHBeagle Apr 09 '17 at 19:09
  • 2
    This is inefficient as the exists check is redundant. Instead you can retrieve the @@ROWCOUNT after running UPDATE and if the @@ROWCOUNT <> BatchSize then you are done and can exit the loop. – Shiv Mar 07 '19 at 23:05
  • @Shiv `@@ROWCOUNT` is a global variable. Statements such as USE, SET – Kramb Mar 08 '19 at 14:13
  • 2
    @Kramb I know - you can cache the rowcount as per answer https://stackoverflow.com/a/55054293/1519839 – Shiv Mar 10 '19 at 23:44
19

I encountered this thread yesterday and wrote a script based on the accepted answer. It turned out to perform very slowly, taking 12 hours to process 25M of 33M rows. I wound up cancelling it this morning and working with a DBA to improve it.

The DBA pointed out that the is null check in my UPDATE query was using a Clustered Index Scan on the PK, and it was the scan that was slowing the query down. Basically, the longer the query runs, the further it needs to look through the index for the right rows.

The approach he came up with was obvious in hind sight. Essentially, you load the IDs of the rows you want to update into a temp table, then join that onto the target table in the update statement. This uses an Index Seek instead of a Scan. And ho boy does it speed things up! It took 2 minutes to update the last 8M records.

Batching Using a Temp Table

SET NOCOUNT ON

DECLARE @Rows INT,
        @BatchSize INT,
        @Completed INT,
        @Total INT,
        @Message nvarchar(max)

SET @BatchSize = 4000
SET @Rows = @BatchSize
SET @Completed = 0

-- #targetIds table holds the IDs of ALL the rows you want to update
SELECT Id into #targetIds 
FROM TheTable 
WHERE Foo IS NULL 
ORDER BY Id

-- Used for printing out the progress
SELECT @Total = @@ROWCOUNT

-- #batchIds table holds just the records updated in the current batch
CREATE TABLE #batchIds (Id UNIQUEIDENTIFIER);

-- Loop until #targetIds is empty
WHILE EXISTS (SELECT 1 FROM #targetIds)
BEGIN
    -- Remove a batch of rows from the top of #targetIds and put them into #batchIds
    DELETE TOP (@BatchSize)
    FROM #targetIds
    OUTPUT deleted.Id INTO #batchIds  

    -- Update TheTable data
    UPDATE t
    SET Foo = 'bar'
    FROM TheTable t
    JOIN #batchIds tmp ON t.Id = tmp.Id
    WHERE t.Foo IS NULL
    
    -- Get the # of rows updated
    SET @Rows = @@ROWCOUNT

    -- Increment our @Completed counter, for progress display purposes
    SET @Completed = @Completed + @Rows

    -- Print progress using RAISERROR to avoid SQL buffering issue
    SELECT @Message = 'Completed ' + cast(@Completed as varchar(10)) + '/' + cast(@Total as varchar(10))
    RAISERROR(@Message, 0, 1) WITH NOWAIT    

    -- Quick operation to delete all the rows from our batch table
    TRUNCATE TABLE #batchIds;
END

-- Clean up
DROP TABLE IF EXISTS #batchIds;
DROP TABLE IF EXISTS #targetIds;

Batching the slow way, do not use!

For reference, here is the original slower performing query:

SET NOCOUNT ON

DECLARE @Rows INT,
        @BatchSize INT,
        @Completed INT,
        @Total INT

SET @BatchSize = 4000
SET @Rows = @BatchSize
SET @Completed = 0
SELECT @Total = COUNT(*) FROM TheTable WHERE Foo IS NULL

WHILE (@Rows = @BatchSize)
BEGIN

    UPDATE t
    SET Foo = 'bar'
    FROM TheTable t
    JOIN #batchIds tmp ON t.Id = tmp.Id
    WHERE t.Foo IS NULL

SET @Rows = @@ROWCOUNT
SET @Completed = @Completed + @Rows
PRINT 'Completed ' + cast(@Completed as varchar(10)) + '/' + cast(@Total as varchar(10))

END

m0nhawk
  • 22,980
  • 9
  • 45
  • 73
mikesigs
  • 10,491
  • 3
  • 33
  • 40
  • 2
    +1 I agree that this is a very effective approach and that people should try it out. I updated my answer to direct readers here if they encounter performance issues like you did with my approach. I've used this pattern before, so maybe I was focused on addressing the flaws in the O.P.'s approach. I did have a few recommendations on minor changes to your approach which I detail at the end of my answer. One thing I didn't mention was simplifying the output, which you can do via: `DECLARE @Completed INT = 5, @Total INT = 37; RAISERROR('Completed %d / %d', 10, 1, @Completed, @Total) WITH NOWAIT;` – Solomon Rutzky Sep 17 '21 at 23:24
  • the update statements in the original slow version and the newer version are identical + they both contain the "IS NULL" check which you said was the problem. why is one slower than the other? Also, i don't see your #batchIds table being created in the second (slower) version. Was that supposed to be a reference to some other table? – Tyler Jones Jul 11 '22 at 23:18
  • I've been working with SQL for only a few months now, so I'm not every experienced yet. However, if instead of doing a join on the temp table, we could do a subquery on the temp table about the indexed id. Then I believe the solution should be O(nlogm) instead of O(2*nlogm) because you wouldn't need to rescan the table after the join occurs. So instead of a join it would look something like: `UPDATE t SET Foo = 'bar' FROM TheTable t WHERE t.Foo IS NULL AND EXISTS (SELECT id FROM #batchIds tmp WHERE t.id = tmp.id)` – Anon Aug 05 '22 at 20:33
6

I want share my experience. A few days ago I have to update 21 million records in table with 76 million records. My colleague suggested the next variant. For example, we have the next table 'Persons':

Id | FirstName | LastName | Email            | JobTitle
1  | John      |  Doe     | abc1@abc.com     | Software Developer
2  | John1     |  Doe1    | abc2@abc.com     | Software Developer
3  | John2     |  Doe2    | abc3@abc.com     | Web Designer

Task: Update persons to the new Job Title: 'Software Developer' -> 'Web Developer'.

1. Create Temporary Table 'Persons_SoftwareDeveloper_To_WebDeveloper (Id INT Primary Key)'

2. Select into temporary table persons which you want to update with the new Job Title:

INSERT INTO Persons_SoftwareDeveloper_To_WebDeveloper SELECT Id FROM
Persons WITH(NOLOCK) --avoid lock 
WHERE JobTitle = 'Software Developer' 
OPTION(MAXDOP 1) -- use only one core

Depends on rows count, this statement will take some time to fill your temporary table, but it would avoid locks. In my situation it took about 5 minutes (21 million rows).

3. The main idea is to generate micro sql statements to update database. So, let's print them:

DECLARE @i INT, @pagesize INT, @totalPersons INT
    SET @i=0
    SET @pagesize=2000
    SELECT @totalPersons = MAX(Id) FROM Persons

    while @i<= @totalPersons
    begin
    Print '
    UPDATE persons 
      SET persons.JobTitle = ''ASP.NET Developer''
      FROM  Persons_SoftwareDeveloper_To_WebDeveloper tmp
      JOIN Persons persons ON tmp.Id = persons.Id
      where persons.Id between '+cast(@i as varchar(20)) +' and '+cast(@i+@pagesize as varchar(20)) +' 
        PRINT ''Page ' + cast((@i / @pageSize) as varchar(20))  + ' of ' + cast(@totalPersons/@pageSize as varchar(20))+'
     GO
     '
     set @i=@i+@pagesize
    end

After executing this script you will receive hundreds of batches which you can execute in one tab of MS SQL Management Studio.

4. Run printed sql statements and check for locks on table. You always can stop process and play with @pageSize to speed up or speed down updating(don't forget to change @i after you pause script).

5. Drop Persons_SoftwareDeveloper_To_AspNetDeveloper. Remove temporary table.

Minor Note: This migration could take a time and new rows with invalid data could be inserted during migration. So, firstly fix places where your rows adds. In my situation I fixed UI, 'Software Developer' -> 'Web Developer'. More about this method on my blog https://yarkul.com/how-smoothly-insert-millions-of-rows-in-sql-server/

Yara
  • 4,441
  • 6
  • 42
  • 62
5

This is a more efficient version of the solution from @Kramb. The existence check is redundant as the update where clause already handles this. Instead you just grab the rowcount and compare to batchsize.

Also note @Kramb solution didn't filter out already updated rows from the next iteration hence it would be an infinite loop.

Also uses the modern batch size syntax instead of using rowcount.

DECLARE @batchSize INT, @rowsUpdated INT
SET @batchSize = 1000;
SET @rowsUpdated = @batchSize; -- Initialise for the while loop entry

WHILE (@batchSize = @rowsUpdated)
BEGIN
    UPDATE TOP (@batchSize) TableName
    SET Value = 'abc1'
    WHERE Parameter1 = 'abc' AND Parameter2 = 123 and Value <> 'abc1';

    SET @rowsUpdated = @@ROWCOUNT;
END
Shiv
  • 1,274
  • 1
  • 19
  • 23
  • The loop wouldn't be infinite because it is updating the result set with the same parameters it is filtering with. Therefore, the following result sets would not include rows that were previously updated. – Kramb Mar 12 '19 at 19:21
  • @Kramb You were updating Value and looking at Parameter1 and Parameter2. So no you were not filtering on the field you were actually updating. I added the third filter term to check for Value <> 'abc1' which you were missing in your answer. – Shiv Mar 13 '19 at 03:58
  • Try again...My answer clearly shows that the very first condition in my `Exists` query is, in fact, the check to determine if `Value` equals the value the inner query is setting `Value` to. – Kramb Mar 15 '19 at 20:54
  • 2
    @Kramb if the top 1000 rows in your update query have the value already set but rows outside the top 1000 do not have value set, your solution will INFINITE LOOP. Your solution is flawed because you are missing the where check on the actual UPDATE call. The exists check is not the problem. – Shiv Mar 18 '19 at 05:34
2

Your print is messing things up, because it resets @@ROWCOUNT. Whenever you use @@ROWCOUNT, my advice is to always set it immediately to a variable. So:

DECLARE @RC int;
WHILE @RC > 0 or @RC IS NULL
    BEGIN
        SET rowcount 5;

        UPDATE TableName
            SET Value  = 'abc1'
            WHERE Parameter1  = 'abc' AND Parameter2  = 123 AND Value <> 'abc1';

        SET @RC = @@ROWCOUNT;
        PRINT(@@ROWCOUNT)
    END;

SET rowcount = 0;

And, another nice feature is that you don't need to repeat the update code.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 1
    See the link I posted above about using ROWCOUNT to control rows for updates. – Sean Lange Mar 09 '16 at 22:00
  • 1
    @Gordon I use this logic and query is running for 2 mins(only for 14 records!!!). It goes to infinite loop. – GThree Mar 09 '16 at 22:05
  • @CSharper . . . Hmmm, if no rows are updated, then `@@ROWCOUNT` should be 0, not NULL. The reason for an infinite loop is not obvious. What is the "print" producing? If the `update` is producing `NULL`, that can be fixed by setting `@RC` to some arbitrary value and then removing the `@RC IS NULL` condition from the `WHILE`. – Gordon Linoff Mar 09 '16 at 22:25
  • @GordonLinoff print produces `(5 row(s) affected) 1` infinitely. – GThree Mar 10 '16 at 14:13
  • 1
    @CSharper . . . The `where` clause needs to exclude already updated rows. – Gordon Linoff Mar 11 '16 at 02:08
  • The last line should be `SET rowcount 0` without the equals sign – NibblyPig Nov 21 '17 at 15:13
-2

First of all, thank you all for your inputs. I tweak my Query - 1 and got my desired result. Gordon Linoff is right, PRINT was messing up my query so I modified it as following:

Modified Query - 1:

SET ROWCOUNT 5
WHILE (1 = 1)
  BEGIN
    BEGIN TRANSACTION

        UPDATE TableName 
        SET Value = 'abc1' 
        WHERE Parameter1 = 'abc' AND Parameter2 = 123

        IF @@ROWCOUNT = 0
          BEGIN
                COMMIT TRANSACTION
                BREAK
          END
    COMMIT TRANSACTION
  END
SET ROWCOUNT  0

Output:

(5 row(s) affected)

(5 row(s) affected)

(4 row(s) affected)

(0 row(s) affected)
Community
  • 1
  • 1
GThree
  • 2,708
  • 7
  • 34
  • 67