51

I need to initialize a new field with the value -1 in a 120 Million record table.

Update table
       set int_field = -1;

I let it run for 5 hours before canceling it.

I tried running it with transaction level set to read uncommitted with the same results.

Recovery Model = Simple.
MS SQL Server 2005

Any advice on getting this done faster?

Ale
  • 946
  • 4
  • 17
  • 28
Bob Probst
  • 9,533
  • 8
  • 32
  • 41
  • 3
    The problem is IO based. Remus addressed this issue here: http://stackoverflow.com/questions/3523831/update-statement-running-for-too-long-or-not/3523903#3523903 – OMG Ponies Sep 14 '10 at 17:17
  • I was worried that it may be IO based. Sadly, not much I can do about that. Would smaller incremental batches run faster? – Bob Probst Sep 14 '10 at 17:19
  • Nope. The overall data volume will be the same: updated pages and log pages. It's still 120M rows, and those rows effectively get written twice (once to the log, and once to the backing store). So the time should be proportional to the overall size of your table. If it's a table with a single 4 byte int, that's rough 1G of data to write (120 * 4 = 480, *2 - 960). So, how big is the overall table? – Will Hartung Sep 14 '10 at 17:58
  • @Will Hartung, I strongly disagree. – Denis Valeev Sep 14 '10 at 18:00
  • Can you post the schema of the table? I'm most curious about the number of columns, data type, and size of each. – 8kb Sep 14 '10 at 20:01
  • Re: `Processing times began to tank (10 minutes per batch) after the first couple dozen.` Just follow my advice. – Denis Valeev Sep 15 '10 at 08:33

11 Answers11

41

The only sane way to update a table of 120M records is with a SELECT statement that populates a second table. You have to take care when doing this. Instructions below.


Simple Case

For a table w/out a clustered index, during a time w/out concurrent DML:

  • SELECT *, new_col = 1 INTO clone.BaseTable FROM dbo.BaseTable
  • recreate indexes, constraints, etc on new table
  • switch old and new w/ ALTER SCHEMA ... TRANSFER.
  • drop old table

If you can't create a clone schema, a different table name in the same schema will do. Remember to rename all your constraints and triggers (if applicable) after the switch.


Non-simple Case

First, recreate your BaseTable with the same name under a different schema, eg clone.BaseTable. Using a separate schema will simplify the rename process later.

  • Include the clustered index, if applicable. Remember that primary keys and unique constraints may be clustered, but not necessarily so.
  • Include identity columns and computed columns, if applicable.
  • Include your new INT column, wherever it belongs.
  • Do not include any of the following:
    • triggers
    • foreign key constraints
    • non-clustered indexes/primary keys/unique constraints
    • check constraints or default constraints. Defaults don't make much of difference, but we're trying to keep things minimal.

Then, test your insert w/ 1000 rows:

-- assuming an IDENTITY column in BaseTable
SET IDENTITY_INSERT clone.BaseTable ON
GO
INSERT clone.BaseTable WITH (TABLOCK) (Col1, Col2, Col3)
SELECT TOP 1000 Col1, Col2, Col3 = -1
FROM dbo.BaseTable
GO
SET IDENTITY_INSERT clone.BaseTable OFF

Examine the results. If everything appears in order:

  • truncate the clone table
  • make sure the database in in bulk-logged or simple recovery model
  • perform the full insert.

This will take a while, but not nearly as long as an update. Once it completes, check the data in the clone table to make sure it everything is correct.

Then, recreate all non-clustered primary keys/unique constraints/indexes and foreign key constraints (in that order). Recreate default and check constraints, if applicable. Recreate all triggers. Recreate each constraint, index or trigger in a separate batch. eg:

ALTER TABLE clone.BaseTable ADD CONSTRAINT UQ_BaseTable UNIQUE (Col2)
GO
-- next constraint/index/trigger definition here

Finally, move dbo.BaseTable to a backup schema and clone.BaseTable to the dbo schema (or wherever your table is supposed to live).

-- -- perform first true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
-- GO
-- -- create a backup schema, if necessary
-- CREATE SCHEMA backup_20100914
-- GO
BEGIN TRY
  BEGIN TRANSACTION
  ALTER SCHEMA backup_20100914 TRANSFER dbo.BaseTable
  -- -- perform second true-up operation here, if necessary
  -- EXEC clone.BaseTable_TrueUp
  ALTER SCHEMA dbo TRANSFER clone.BaseTable
  COMMIT TRANSACTION
END TRY
BEGIN CATCH
  SELECT ERROR_MESSAGE() -- add more info here if necessary
  ROLLBACK TRANSACTION
END CATCH
GO

If you need to free-up disk space, you may drop your original table at this time, though it may be prudent to keep it around a while longer.

Needless to say, this is ideally an offline operation. If you have people modifying data while you perform this operation, you will have to perform a true-up operation with the schema switch. I recommend creating a trigger on dbo.BaseTable to log all DML to a separate table. Enable this trigger before you start the insert. Then in the same transaction that you perform the schema transfer, use the log table to perform a true-up. Test this first on a subset of the data! Deltas are easy to screw up.

Peter Radocchia
  • 10,710
  • 2
  • 34
  • 56
15

If you have the disk space, you could use SELECT INTO and create a new table. It's minimally logged, so it would go much faster

select t.*, int_field = CAST(-1 as int)
into mytable_new 
from mytable t

-- create your indexes and constraints

GO

exec sp_rename mytable, mytable_old
exec sp_rename mytable_new, mytable

drop table mytable_old
Mike Forman
  • 4,367
  • 1
  • 21
  • 18
9

I break the task up into smaller units. Test with different batch size intervals for your table, until you find an interval that performs optimally. Here is a sample that I have used in the past.

declare @counter int 
declare @numOfRecords int
declare @batchsize int

set @numOfRecords = (SELECT COUNT(*) AS NumberOfRecords FROM <TABLE> with(nolock))
set @counter = 0 
set @batchsize = 2500

set rowcount @batchsize
while @counter < (@numOfRecords/@batchsize) +1
begin 
set @counter = @counter + 1 
Update table set int_field = -1 where int_field <> -1;
end 
set rowcount 0
K.C.
  • 111
  • 3
4

If your int_field is indexed, remove the index before running the update. Then create your index again...

5 hours seem like a lot for 120 million recs.

Pablo Santa Cruz
  • 176,835
  • 32
  • 241
  • 292
3
declare @cnt bigint
set @cnt = 1

while @cnt*100<10000000 
 begin

UPDATE top(100) [Imp].[dbo].[tablename]
   SET [col1] = xxxx 
 WHERE[col1] is null  

  print '@cnt: '+convert(varchar,@cnt)
  set @cnt=@cnt+1
  end
SQL Police
  • 4,127
  • 1
  • 25
  • 54
aads
  • 2,255
  • 4
  • 24
  • 25
3

What I'd try first is
to drop all constraints, indexes, triggers and full text indexes first before you update.

If above wasn't performant enough, my next move would be
to create a CSV file with 12 million records and bulk import it using bcp.

Lastly, I'd create a new heap table (meaning table with no primary key) with no indexes on a different filegroup, populate it with -1. Partition the old table, and add the new partition using "switch".

dance2die
  • 35,807
  • 39
  • 131
  • 194
3
set rowcount 1000000
Update table set int_field = -1 where int_field<>-1

see how fast that takes, adjust and repeat as necessary

BlackTigerX
  • 6,006
  • 7
  • 38
  • 48
3

When adding a new column ("initialize a new field") and setting a single value to each existing row, I use the following tactic:

ALTER TABLE MyTable
 add NewColumn  int  not null
  constraint MyTable_TemporaryDefault
   default -1

ALTER TABLE MyTable
 drop constraint MyTable_TemporaryDefault

If the column is nullable and you don't include a "declared" constraint, the column will be set to null for all rows.

Philip Kelley
  • 39,426
  • 11
  • 57
  • 92
  • do you know if this has the performance profile of an `UPDATE` statement, a bulk logged `INSERT` statement, or something else? – Peter Radocchia Sep 15 '10 at 15:13
  • I suspect "something else". I've never done this on truly large tables, but it tends to run *very* quickly on moderate ones. Regardless, when you have to add a column to an existing table, you *have* to pay an "all or nothing" price, as you can't add it to just one batch of rows at a time. (Flip side is INSERT....SELECT... for N rows at a time, but if you're doing that you'll probably have to schedule some downtime, at which point locking, blocking, and timing shouldn't be quite so critical.) – Philip Kelley Sep 15 '10 at 15:39
  • I've tried this on a table with > 380 million rows and it looks like it is updating the table in the same way and taking similiar amount of time. – Otake Jun 24 '15 at 09:52
1

Sounds like an indexing problem, like Pabla Santa Cruz mentioned. Since your update is not conditional, you can DROP the column and RE-ADD it with a DEFAULT value.

Brad
  • 15,361
  • 6
  • 36
  • 57
  • 1
    Maybe I'm doing something wrong but my experience with Defaults is that they are not set when a column is created but only when a record is inserted. – Bob Probst Sep 14 '10 at 17:33
  • You're probably right (I've never actually done this). Perhaps if you disallowed NULLS on the column, it would gets the value from the default value constraint (since it's added inline, it would have an auto-generated name). – Brad Sep 14 '10 at 17:49
  • 1
    @TrickyNixon: in fact, Brad's suggestion will work, if you recreate the field as not null and declare a default constraint in the same statement: `ALTER TABLE TableName ADD int_field INT NOT NULL CONSTRAINT DF_TableName_int_field DEFAULT (1)`. Whether this has the performance profile of an UPDATE statement, an INSERT statement, or something else, I do not know. When in doubt, create a test table w/ a subset of the data and try it there. – Peter Radocchia Sep 15 '10 at 15:11
1

In general, recommendation are next:

  1. Remove or just Disable all INDEXES, TRIGGERS, CONSTRAINTS on the table;
  2. Perform COMMIT more often (e.g. after each 1000 records that were updated);
  3. Use select ... into.

But in particular case you should choose the most appropriate solution or their combination.

Also bear in mind that sometime index could be useful e.g. when you perform update of non-indexed column by some condition.

alexber
  • 647
  • 1
  • 5
  • 9
0

If the table has an index which you can iterate over I would put update top(10000) statement in a while loop moving over the data. That would keep the transaction log slim and won't have such a huge impact on the disk system. Also, I would recommend to play with maxdop option (setting it closer to 1).

Denis Valeev
  • 5,975
  • 35
  • 41