I am writing a query to update a large table (over 6 billion rows). Unfortunately, inserting into a new table is not an option due to disk storage constraints.
I have done performance testing with both the answers to the question asked here and the second answer seems to be the quickest option.
I've tested the query performance with 1, 2, 3, 5 and 10 million rows. The performance is reduced drastically the more rows I add to the update.
Here's a sample of my update query:
DECLARE @Rows INT, @BatchSize INT;
SET @BatchSize = 4500;
SET @Rows = @BatchSize;
WHILE (@Rows = @BatchSize)
BEGIN
UPDATE TOP (@BatchSize) [dbo].[TestTable]
SET MyID1 = MyID2
FROM [dbo].[TestTable]
WHERE MyID1 <> MyID2
SET @Rows = @@ROWCOUNT;
END
Here's the time to complete and rows updated per second:
- 1 Million Rows: 43 seconds to complete, 23255 rows/second
- 2 Million Rows: 168 seconds to complete, 11904 rows/second
- 3 Million Rows: 366 seconds to complete, 8196 rows/second
- 5 Million Rows: 1098 seconds to complete, 4553 rows/second
- 10 Million Rows: 3922 seconds to complete, 2549 rows/second
I would assume that I should be able to average out the rows per second and get an estimate of time to completion but the time to complete seems to grow exponentially. For example, doubling the time to complete 5 million rows should complete around 2200 seconds but it's taking 3922 seconds.
Testing multiple times returns similar results so I don't think I'm dealing with a contention issue. Plus, this is on a testing server that is not used. What am I missing here and how do I accurately estimate the time to complete this update?