2

I am writing a query to update a large table (over 6 billion rows). Unfortunately, inserting into a new table is not an option due to disk storage constraints.

I have done performance testing with both the answers to the question asked here and the second answer seems to be the quickest option.

I've tested the query performance with 1, 2, 3, 5 and 10 million rows. The performance is reduced drastically the more rows I add to the update.

Here's a sample of my update query:

DECLARE @Rows INT, @BatchSize INT;
SET @BatchSize = 4500;

SET @Rows = @BatchSize; 

WHILE (@Rows = @BatchSize)
BEGIN
    UPDATE TOP (@BatchSize) [dbo].[TestTable]
    SET MyID1 = MyID2
    FROM  [dbo].[TestTable]
    WHERE MyID1 <> MyID2

    SET @Rows = @@ROWCOUNT;
END

Here's the time to complete and rows updated per second:

  • 1 Million Rows: 43 seconds to complete, 23255 rows/second
  • 2 Million Rows: 168 seconds to complete, 11904 rows/second
  • 3 Million Rows: 366 seconds to complete, 8196 rows/second
  • 5 Million Rows: 1098 seconds to complete, 4553 rows/second
  • 10 Million Rows: 3922 seconds to complete, 2549 rows/second

I would assume that I should be able to average out the rows per second and get an estimate of time to completion but the time to complete seems to grow exponentially. For example, doubling the time to complete 5 million rows should complete around 2200 seconds but it's taking 3922 seconds.

Testing multiple times returns similar results so I don't think I'm dealing with a contention issue. Plus, this is on a testing server that is not used. What am I missing here and how do I accurately estimate the time to complete this update?

tonyd
  • 315
  • 1
  • 14
  • what is the recovery mode of your database? what is the table schema? – Hiten004 Jun 07 '18 at 16:12
  • Maybe stupid comment, but have you tried an approach with a CTE, adding an ORDER BY on an indexed column to your Select ? https://stackoverflow.com/a/11562724/1236044 – jbl Jun 07 '18 at 16:20
  • 1
    This is going to be hard to estimate. Indexes clearly play a large role here. You can speed up the updated by dropping them and then recreating them after. Now, the rebuild is going to take some time, so you'll want to plan accordingly for that. Every time your batch loops it has to account for the where condition... hence your exponential growth. – S3S Jun 07 '18 at 16:29
  • 1
    Batching performance will be progressively slower due to only the `<>` predicate. Consider updating in clustered index key ranges for the batching instead of TOP. – Dan Guzman Jun 07 '18 at 16:50
  • @Hiten004 Simple Recovery. Table schema has 12 columns. MyID1 is a varchar(24), MyID2 is a varchar(15). – tonyd Jun 07 '18 at 17:03
  • @scsimon Yes, I had done some previous testing with and without indexes and no indexes was definitely faster. I hadn't thought about the predicate causing exponential growth. I'll see if I can find another way without it. – tonyd Jun 07 '18 at 17:07
  • @DanGuzman Okay, I will give that a try. – tonyd Jun 07 '18 at 17:16
  • Does the table have a unique key of any kind? – pmbAustin Jun 07 '18 at 17:19
  • @jbl So I've done testing with a non-clustered indexes on MyID1 and MyID2. This actually slows down performance because it is updating the Index each time. Going to try to use a clustered index and batch it with this column. – tonyd Jun 07 '18 at 17:24
  • @pmbAustin Yes, it does. Per Dan Guzman's suggestion, I'm working on a new query to use this unique column with a clustered index. – tonyd Jun 07 '18 at 17:31
  • One thing to consider is to perhaps use a filtered index (index where MyID1 <> MyID2) and then fill a temp table with the rows to be updated, and then update by joining with the temp table (after indexing the temp table). Breaking it up this way means that the indexes don't interfere with the updates as the query is being executed. This can typically make things go much faster. – pmbAustin Jun 07 '18 at 19:40

1 Answers1

0

I was able to come up with an efficient update query. The query below updates 5 million rows in 15 seconds and 10 million rows in 30 seconds.

I created a clustered index on the BatchKey column which is an integer identity column. Then wrote the query to loop through the rows based on the Min and Max BatchKey.

DECLARE @MaxKey BIGINT, @MinKey BIGINT, @iMax BIGINT, @iMin BIGINT

SELECT 
    @MinKey = MIN(BatchKey), 
    @MaxKey = MAX(BatchKey)
FROM dbo.TestTable

SET @iMin = @MinKey         -- starting minimum key
SET @iMax = @MinKey + 4500  -- starting maximum key for batch

WHILE (@iMin < @MaxKey) 
BEGIN
    UPDATE dbo.TestTable
    SET MyID1 = MyID2
    FROM  dbo.TestTable
    WHERE BatchKey >= @iMin
    AND BatchKey <= @iMax

    SET @iMin = @iMax + 1       -- get next minimum key
    SET @iMax = @iMin + 4500    -- get next maximum batch key

END
tonyd
  • 315
  • 1
  • 14