How to properly update big tables without making big transient journal?

Question

I have table names tb_big with size 50 Gb. If a create an aditional column:

alter TABLE UAT_DM.ai_SUBS_MONTH_CLR ADD segment CHAR(5) not casespecific;

Everything works fine. All values are null so update will work pretty fast (1+ minutes):

update tb_big
set segment = case when LT_month <= 4 then '0'
     when LT_month <= 8 then '1'
     when LT_month <= 12 then '2'
     when LT_month <= 17 then '3'
     when LT_month <= 24 then '4'
     when LT_month <= 36 then '5'
     when LT_month <= 56 then '6'
     when LT_month <= 83 then '7'
     when LT_month <= 96 then '8'
     else  '9' end;

So let say I want to update it second time same column:

update tb_big
    set segment = case when LT_month <= 4 then '0'
         when LT_month <= 8 then '1'
         when LT_month <= 12 then '2'
         when LT_month <= 17 then '3'
         when LT_month <= 27 then '42'
         when LT_month <= 36 then '52'
         when LT_month <= 56 then '6'
         when LT_month <= 83 then '7'
         when LT_month <= 96 then '08'
         else  '9' end;

Because of big table size and some unexpected TD behaviour such update will work under transaction, so every update will be logged to transient journal, which for unknown reason for me will be highly skewed (99.9%+) and take terabytes of spool. I almost put production server on knees (TD administrator didnt shutted down it and prefered it to finish and to fail to backups, because it could take long to time to roll back back. Is it true?)

My question is how to update big tables properly? My idea is to drop that column and repeat from the beginning. However I afraid I will get no more spool error (see ). A good possible solution might be creation new empty table and copying from the first table with all columns except that to be modified one. But taking x2 space isnt good practice.

Recomendation from TD administrators do not update more that 200k rows sounds ridiculous for me.

DDL of the table, which has 500 mln rows, size 50 Gb:

CREATE MULTISET TABLE UAT_DM.ai_SUBS_MONTH_CLR ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      CUST_ID DECIMAL(12,0),
      LT_month DECIMAL(15,2),
      days_to_LF(15,0),
      REV_COM DECIMAL(18,6),
      device_type VARCHAR(50) CHARACTER SET UNICODE CASESPECIFIC,
      usg_qq DECIMAL(18,0),
      usg_dd DECIMAL(18,6),
      report_mnth CHAR(7) CHARACTER SET UNICODE NOT CASESPECIFIC,
      MACN_ID DECIMAL(15,0),
      segment CHAR(5) CHARACTER SET LATIN NOT CASESPECIFIC)
UNIQUE PRIMARY INDEX ( SUBS_ID ,report_mnth )
INDEX ( CUST_ID )
INDEX ( segment );

Maybe you should check this: https://stackoverflow.com/questions/3711217/fastest-way-to-update-120-million-records — sorineatza, May 25 '18 at 13:15
@sorineatza I think that post is a bit dated, 8 years old and tagged SQL Server 2005. I don't think it 100% applies today. I think OP could certainly break this update into separate statements (as the admin seems to suggest) rather than a giant, monolithic case update. — Jacob H, May 25 '18 at 13:17
Of course this Update will be treated as a transaction, but 50GB is not a big table in Teradata and updating it should not result in TBs of Transient Journal. And both Updates should run quite similar. Updating in tranches of 200k records sounds like SQL Server or Oracle to me. Can you share the DDL of that table and how many rows it has? — dnoeth, May 25 '18 at 13:43
What about: CREATE SET TABLE ,NO FALLBACK, NO BEFORE JOURNAL,NO AFTER JOURNAL, ... ? — access_granted, May 25 '18 at 22:41
@access_granted is already in that format, except creating set table isnt good idea — Rocketq, May 28 '18 at 06:25

score 2 · Accepted Answer · answered May 28 '18 at 09:26

You didn't tell a very important detail.

First you added the new column segment and then updated it.

Then you created a Secondary Index on segment.

Then you run the 2nd Update.

This NUSI is quite small due to the low number of distinct values. Maintenance is also quite costly due to the high number of repeated values (I never tested it, but this should be the reason for the skewed Transient Journal, too).

There's a rule of thumb (and it's similar in most (all?) DBMSes): Don't update a large percentage of rows when it's an indexed column.

You usually do this once and then you get a call from the DBA you will never forget :-) (of course it shouldn't be a you better update in tranches)

Drop the index before the Update and then recreate it.

But why do you put a Secondary Index on segment? It will probably not be used by the optimizer due to it's low selectivity. Of course then you could do the CASE as a calculation instead of a column.

And assuming that SUBS_ID is the same as CUST_ID you can get rid of the 2nd NUSI by changing the Primary Index to PRIMARY INDEX ( SUBS_ID ), there's probably not a large number of rows per SUBS_ID. If you access a lot using report_mnth you could partition by it (then you better change it to an INT, 201805 instead of '2018-05').

Thank you, now everything sounds clear to me. You have much more effort to explain that my DBAs have;) — Rocketq, May 28 '18 at 19:48

How to properly update big tables without making big transient journal?

1 Answers1