1

I have a SQL Server database, which has a table called searches_tweets with hundreds of millions of rows. In that table, I have search_id and a datetime_created_at column (I also have a datetime_modified_at and tweet_id columns)

I need to update the search_id of 270 million rows. These rows correspond to a certain value of search_id and a datetime_created_at (I need to change all the search_id after a certain date).

This is the first query I tried:

update searches_tweets 
set search_id = 123 
where search_id = 456 
  and datetime_created_at > '2020-01-07'

It was really slow and, after a few reconnection attempts (and all the changes not being committed) I decided to do a bulk update set with a commit after 4000 updated rows:

DECLARE @Rows INT,
        @BatchSize INT; 
    
SET @BatchSize = 4000;

SET @Rows = @BatchSize; 

BEGIN TRY    
  WHILE (@Rows = @BatchSize)
  BEGIN
      UPDATE TOP (@BatchSize) tab
      SET tab.search_id = 123
      FROM searches_tweets tab
      WHERE tab.search_id = 456
        AND tab.datetime_created_at  > '2020-01-07'
        AND tab.search_id <> 123

      SET @Rows = @@ROWCOUNT;
  END;
END TRY
BEGIN CATCH
  RAISERROR(
           N'Error message', 
           10, 
           1  
);
  RETURN;
END CATCH;

The table has 3 indexes:

  • One is on columns search_id and tweet_id
  • One is on column search_id
  • One is on columns datetime_modified_at and datetime_created_at

The table has a composite primary key that includes both search_id and tweet_id columns.

This works faster, but it still takes too long due to the amount of rows I have to update. Do you think about any way to do this faster?

Thank you!

Mark Johnson
  • 101
  • 1
  • 12
  • 1
    How slow are we talking here? Is the table indexed? – Radagast Nov 19 '21 at 15:05
  • It Update Set about 1.5 million rows per hour. This means it would take 180 hours (almost 8 days) to finish. Yes, it is indexed. I just edited my answer to mention the indexes – Mark Johnson Nov 19 '21 at 15:13
  • 2
    more importantly is search_id included in many indexes? The cost of the update may be in maintaining the indexes. Depending on how heavily this table is used and if you can; you may want to consider locking the table disable the indexes, perform the update, rebuild the index. and what about triggers? Would any fire as a result of the update? – xQbert Nov 19 '21 at 15:15
  • 1
    Yes, search_id is included in 2 indexes. Most importantly, I think a performance issue may come from the fact that there is no index including both search_id and datetime_created_at. What do you think? – Mark Johnson Nov 19 '21 at 15:20
  • 1
    is there a PK on the table outside of the searches_Tweets beyond search_ID? Could you do something like this: https://stackoverflow.com/questions/35903375/how-to-update-large-table-with-millions-of-rows-in-sql-server/35931214? Kinda looks like you already are to some extent... but what about filtering by a temp table and joinin on PK's if possible. But yes; missing index could be causing issue. have to look at plan to see what the performance hit is from. – xQbert Nov 19 '21 at 15:22
  • I have a composite primary key that includes search_id and tweet_id columns. I will add this to the original question – Mark Johnson Nov 19 '21 at 15:31
  • you have `WHERE tab.search_id = 456` so there is no need for the `AND tab.search_id<> 123` in your batch solution. Or am I missing something ? – GuidoG Nov 19 '21 at 15:37
  • "Efficient" can be interpreted in many ways. Does it matter if the table is not accessible during the update? What percentage of rows are you updating? Post DDL so we know exactly how the table is structured - the clustered index is important. Does the table have triggers? Is it used in indexed views? That are many considerations here. – SMor Nov 19 '21 at 15:50
  • Might be worth cloning that table and experimenting with a delete + insert method. – Radagast Nov 19 '21 at 16:43
  • The update shouldn't be that long: so it's got to be missing indexes to find the records; then the cost of Triggers, index updates, and cascade updates which is taking so long. Check dependencies to see if you have triggers/indexes/relationships which would be required to be updated along with the table data. It's all this subordinate data/meta data which is causing your delay. So then ask what can you do to mitigate costs associated to those dependencies. There's more here we don't know yet. – xQbert Nov 19 '21 at 19:56

0 Answers0